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Series Preface 


Mathematics is playing an ever more important role in the physical and biolog- 
ical sciences, provoking a blurring of boundaries between scientific disciplines 
and a resurgence of interest in the modern as well as the classical techniques 
of applied mathematics. This renewal of interest, both in research and teach- 
ing, has led to the establishment of the series Texts in Applied Mathematics 
(TAM). 

The development of new courses is a natural consequence of a high level 
of excitement on the research frontier as newer techniques, such as numerical 
and symbolic computer systems, dynamical systems, and chaos, mix with and 
reinforce the traditional methods of applied mathematics. Thus, the purpose 
of this textbook series is to meet the current and future needs of these advances 
and to encourage the teaching of new courses. 

TAM will publish textbooks suitable for use in advanced undergraduate 
and beginning graduate courses, and will complement the Applied Mathe- 
matical Sciences (AMS) series, which will focus on advanced textbooks and 
research-level monographs. 


Pasadena, California J.E. Marsden 
Providence, Rhode Island L. Sirovich 
College Park, Maryland S.S. Antman 


Preface 


The origin of this textbook is a course on numerical linear algebra that we 
taught to third-year undergraduate students at Université Pierre et Marie 
Curie (Paris 6 University). Numerical linear algebra is the intersection of 
numerical analysis and linear algebra and, more precisely, focuses on practical 
algorithms for solving on a computer problems of linear algebra. 

Indeed, most numerical computations in all fields of applications, such as 
physics, mechanics, chemistry, and finance. involve numerical linear algebra. 
All in all, these numerical simulations boil down to a series of matrix com- 
putations. There are mainly two types of matrix computations: solving linear 
systems of equations and computing eigenvalues and eigenvectors of matrices. 
Of course, there are other important problems in linear algebra, but these two 
are predominant and will be studied in great detail in this book. 

From a theoretical point of view, these two questions are by now com- 
pletely understood and solved. Necessary and sufficient conditions for the 
existence and/or uniqueness of solutions to linear systems are well known, as 
well as criteria for diagonalizing matrices. However, the steady and impres- 
sive progress of computer power has changed those theoretical questions into 
practical issues. An applied mathematician cannot be satisfied by a mere exis- 
tence theorem and rather asks for an algorithm, i.e., a method for computing 
unknown solutions. Such an algorithm must be efficient: it must not take too 
long to run and too much memory on a computer. It must also be stable, 
that is, small errors in the data should produce similarly small errors in the 
output. Recall that errors cannot be avoided, because of rounding off in the 
computer. These two requirements, efficiency and stability, are key issues in 
numerical analysis. Many apparently simple algorithms are rejected because 
of them. 

This book is intended for advanced undergraduate students who have al- 
ready been exposed to linear algebra (for instance, [9], [10], [16]). Nevertheless, 
to be as self-contained as possible, its second chapter recalls the necessary de- 
finitions and results of linear algebra that will be used in the sequel. On the 
other hand, our purpose is to be introductory concerning numerical analysis, 
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for which we do not ask for any prerequisite. Therefore, we do not pretend to 
be exhaustive nor to systematically give the most efficient or recent algorithms 
if they are too complicated. We leave this task to other books at the graduate 
level, such as [2], [7], [11], [12], [14], [17], [18]. For pedagogical reasons we 
satisfy ourselves in giving the simplest and most illustrative algorithms. 
Since the inception of computers and, all the more, the development 
of simple and user-friendly software such as Maple, Mathematica, Matlab, 
Octave, and Scilab, mathematics has become a truly experimental science like 
physics or mechanics. It is now possible and very easy to perform numerical 
experiments on a computer that help in increasing intuition, checking conjec- 
tures or theorems, and quantifying the effectiveness of a method. One original 
feature of this book is to follow an experimental approach in all exercises. The 
reader should use Matlab for solving these exercises, which are given at the 
end of each chapter. For some of them, marked by a (*), complete solutions, 
including Matlab scripts, are given in the last chapter. The solutions of the 
other exercises are available in a solution manual available for teachers and 
professors on request to Springer. The original french version of this book (see 
our web page http://www.ann. jussieu.fr/numalgebra) used Scilab, which 
is probably less popular than Matlab but has the advantage of being free 
software (see http://www.scilab.org). Finally we thank Karim Trabelsi for 
translating a large part of this book from a French previous version of it. 
We hope the reader will enjoy more mathematics by seeing it “in practice.” 


G.A., S.M.K. 
Paris 


Contents 


I Tntroduction o.oo. ec eee tae ratie eae eee bas 1 
1.1 Discretization of a Differential Equation .................... 1 

1.2 Least Squares Fitting ....... 0... eee eee 4 

1.3 Vibrations of a Mechanical System ............... 000.0000. 8 

L4, The Vibrating String cine. wade dad sete tiene gatea 10 

1.5 Image Compression by the SVD Factorization............... 12 

2 Definition and Properties of Matrices...................... 15 
2.1 Gram-Schmidt Orthonormalization Process................. 15 

Did, MATICES acid oo oats BA a debe teed a bed g ee Pe ee as 17 
2.2.1 Trace and Determinant ................ 0.000. e ee eee 19 

2.2.2 Special Matrices......... 00.0... eee eee 20 

2.2.3 Rows and Columns ............ 2.0. ce eee eee eee eee 21 

2.2.4 Row and Column Permutation ...................... 22 

2:2:0- Block Matrices 2.04:33.c:00:83 af ra a a 22 

2.3 Spectral Theory of Matrices ............ 20.0.0 ee eee eee eee 23 

2.4 Matrix Triangularization .....0..0. 0.06 ce eee eee eee e ewes 26 

2.5 Matrix Diagonalization........... 0... eee eee eee 28 

2.6 Min—Max Principle: ssssisaasiss caro iniaiaiai ea were dee 31 

2.7 Singular Values of a Matrix........... 00.0... eee un i nai 33 

2.8 D. EE 5 sous d4 Gn E ds taaGdoud Gand E boca s 38 

3 Matrix Norms, Sequences, and Series.....................-. 45 
3.1 Matrix Norms and Subordinate Norms ..................--- 45 

3.2 Subordinate Norms for Rectangular Matrices ............... 52 

3.3 Matrix Sequences and Series ......... 00. 54 

OA. -TXCPGSCS: tied ee enaa Mead sida ed TRL eee ded 57 

4 Introduction to Algorithmics .....................0 0000000 61 
4.1 Algorithms and pseudolanguage ..............0 00 cece eee eee 61 


4.2 Operation Count and Complexity ..................0.0005- 64 


Contents 


43 The Strassen Algorithm . 26 ¢4.0cg¢ ios eticweeeigaeeigegetad 65 
4.4 Equivalence of Operations .............. 0000 eee ee eens 67 
Aid; “TEXCECISES o orpo pane and cctnk aha GRraG, E E oak Sap eae G, Renee estar 69 
Linear Systems «6.19.6 eustesecd ariei baie abet EN O eas area 71 
5.1 Square Linear Systems ........ 00... cee eee ce eee eee 71 
5.2 Over- and Underdetermined Linear Systems ................ 75 
5.3; Numerical Solution. + 220.24 4¢054 ded d Side bead dei ewas 76 
5.3.1 Floating-Point System ........... 0.0.00... 002, 77 
5:32 Matrix Conditioning. 2.046 c..cacin ies eee ead tas 79 
5.3.3 Conditioning of a Finite Difference Matrix............ 85 
5.3.4 Approximation of the Condition Number ............. 88 
5.3.5 Preconditioning ....... 0.0... 0c eee eee eee 91 
oA- TXCLCISOS 4 std tae eee asc telem e eed e ie eh ee ree hae 92 
Direct Methods for Linear Systems ........................ 97 
6.1 Gaussian Elimination Method................... 00.0200 eee 97 
6.2 LU Decomposition Method ............. 0.00.00 c eee eee eee 103 
6.2.1 Practical Computation of the LU Factorization........ 107 
6.2.2 Numerical Algorithm ......... 0.0.0.0 e eee ee eee eee 108 
6.2.3: Operation Count suere be eee ew ney eee Geet 108 
6.2.4 The Case of Band Matrices ...................0000-5 110 
6.3. Cholesky Method. .i...400 2 ccss¢eeaas ced shae end aeekeaw saan 112 
6.3.1 Practical Computation of the Cholesky Factorization ..113 
6.3.2 Numerical Algorithm osa crisa sieta semn aiir EA 114 
6:3:3: Operation Count siss ov 2s eatin Heide yini edd ddd 115 
6.4 QR Factorization Method ............ 0.0.0.0 eee eee eee 116 
6.4.1; Operation Count .2.2224.¢s.c4 dy ee ee eh ee Ra eek ed 118 
6.5: Exercises ea a eee esas a hb E a eet adin dos 119 
Least Squares Problems ................. 0.0000 c eee ee eee 125 
Tok, Motivation 2.86 wea iraran eee acl e a Sei a tk ees 125 
R2 o Mám Results co on eee tac Gad eek achews Foun Se oes 126 
7.3 Numerical Algorithms.......... 0... 0c cee cee ee ees 128 
7.3.1 Conditioning of Least Squares Problems.............. 128 
7.3.2 Normal Equation Method .....................00005 131 
7.3.3 QR Factorization Method ..................0 0000005 132 
7.3.4 Householder Algorithm............. 0.0.00 ce eee eee ee 136 
TA CHROLCISCS E manii vgs hank dead tebe, SEs ak danas BG tien She React ke cs Sead 140 
Simple Iterative Methods.................. 0.00.00 ce eee eee 143 
Sl) ‘General Setting 2.446 see e546 adie aia a a ted wcaethad 143 
8.2 Jacobi, Gauss-Seidel, and Relaxation Methods .............. 147 
8.241. Jacobi Method 4. 406i.yaesedied thee dade eee dhe 147 


8.2.2 Gauss-Seidel Method ......... 0.0... ccc ec eee 148 


Contents XI 


8.2.3 Successive Overrelaxation Method (SOR)............. 149 

8.3 The Special Case of Tridiagonal Matrices ................... 150 

8.4 Discrete Laplacian .......... 00.0.0 cece eee tenes 154 

8.5 Programming Iterative Methods......................0200. 156 

8:6 Block Methods's.. .ic:4.e5 4 adsute seeds MORAG ERR eae ea the 157 

Sof - WRErGSC6 5 deine eed canes be be dete eases ae es 159 

9 Conjugate Gradient Method.........................0000-5 163 
9.1 The Gradient Method .............. 0.00 cece eee eee 163 

9.2 Geometric Interpretation ............... 2.0 165 

9.3 Some Ideas for Further Generalizations...................-. 168 

9.4 Theoretical Definition of the Conjugate Gradient Method..... 171 

9.5 Conjugate Gradient Algorithm ............... 0.0.0. e eee eee 174 
9.5.1 Numerical Algorithm ............ 0.0... c eee eee eee 178 

9.5.2 Number of Operations ............. 00.0.0 e ee eee eee 179 

9.5.3 Convergence Speed ........... cece eee ee eee ees 180 

9.5.4. Preconditioning... ici ie ct aa deta deen tee eee ns 182 

9.5.5 Chebyshev Polynomials ................ 00.00.00 186 

9.6. - PREPGSC6 ein 4.4 sich cunts Ly Ae Ne A ee Ue ed 189 

10 Methods for Computing Eigenvalues....................... 191 
10.1 Generalities ....0.ec.6 a eta eee dee dae oe dee 191 
10:2 Conditioning; parisis Aa dehte Gea edaaeaoieaededad Bete s 192 
10:3 Power Method o:...5420.4043-44463ad1ener v4 yee ec aa ss 194 
10:4, Jacobi: Method sorires cao h oon pelea eats Soe adel ees 198 
10.5 Givens—Householder Method ............ 00.0000. e eee ee eee 203 
106 QR Methods eissir eenaa a a a a Rebate 209 
10.7. Lanczos Method s.3.42 yerni aedunonira d uka eel aa Bada 214 
10:8 Exercises: ve ie eaa aaa A E a eae as 219 

11 Solutions and Programs ................. 00.06. e eee eee 223 
11.1: Exercises of Chapter 2 25.3. 2.ec cio ec cia eaka gia eta 223 
11.2 Exercises of Chapter 3.1.0.0... 20... cece eee cee eens 234 
11.3 Exercises of Chapter 4 .......... 0... cee eee cee eee 237 
11.4 Exercises of Chapter 5.1.0.0... 2... cece cece ee eens 241 
11.5 Exercises of Chapter 6 .......... 0... cee cece eee eee eens 250 
11.6 Exercises of Chapter T .......... 02.0 eee eee cece eee ees 257 
11.7 Exercises of Chapter 8 .......... 02.0. c eee eee eee eens 258 
11.8 Exercises of Chapter 9.1.0... 0.2... cece eee eee nee 260 
11.9 Exercises of Chapter 10 ............ 0.02. e eee ees 262 
Referencës. ooa estes bees etc ee ieee esa a ae eae SN SR EES 265 
ING OX. ots dati nen dee aneey thee Aha cid be id Eh be oh eed 267 


1 


Introduction 


As we said in the preface, linear algebra is everywhere in numerical simu- 
lations, often well hidden for the average user, but always crucial in terms 
of performance and efficiency. Almost all numerical computations in physics, 
mechanics, chemistry, engineering, economics, finance, etc., involve numerical 
linear algebra, i.e., computations involving matrices. The purpose of this in- 
troduction is to give a few examples of applications of the two main types 
of matrix computations: solving linear systems of equations on the one hand, 
and computing eigenvalues and eigenvectors on the other hand. The following 
examples serve as a motivation for the main notions, methods, and algorithms 
discussed in this book. 


1.1 Discretization of a Differential Equation 


We first give a typical example of a mathematical problem whose solution is 
determined by solving a linear system of large size. This example (which can 
be generalized to many problems all of which have extremely important appli- 
cations) is linked to the approximate numerical solution of partial differential 
equations. A partial differential equation is a differential equation in several 
variables (hence the use of partial derivatives). However, for simplicity, we 
shall confine our exposition to the case of a single space variable x, and to 
real-valued functions. 

A great number of physical phenomena are modeled by the following equa- 
tion, the so-called Laplace, Poisson, or conduction equation (it is also a time- 
independent version of the heat or wave equation; for more details, see [3]): 


—u" (x) + c(x)u(x) = f(x) for all x € (0,1), 
(o =a, u(l)=f, “a (1.1) 


where a, 8 are two real numbers, c(a) is a nonnegative and continuous func- 
tion on [0,1], and f(x) is a continuous function on [0,1]. This second-order 


2 Introduction 


differential equation with its boundary conditions “at both ends” is called a 
boundary value problem. We shall admit the existence and uniqueness of a 
solution u(x) of class C? on [0,1] of the boundary value problem (1.1). If c(x) 
is constant, one can find an explicit formula for the solution of (1.1). However, 
in higher spatial dimensions or for more complex boundary value problems 
(varying c(a), nonlinear equations, system of equations, etc.), there is usually 
no explicit solution. Therefore, the only possibility is to approximate the so- 
lution numerically. The aim of example (1.1) is precisely to show a method 
of discretization and computation of approximate solutions that is called the 
method of finite difference. We call discretization of a differential equation the 
formulation of an approximate problem for which the unknown is no longer a 
function but a finite (discrete) collection of approximate values of this func- 
tion. The finite difference method, which is very simple in dimension 1, can 
easily be generalized, at least in principle, to a wide class of boundary value 
problems. 

In order to compute numerically an approximation of the solution of (1.1), 
we divide the interval {0, 1] into n equal subintervals (i.e., of size 1/n), where n 
is an integer chosen according to the required accuracy (the larger the value of 
n, the “closer” the approximate solution will be to the exact one). We denote 
by a; the (n + 1) endpoints of these intervals: 


GFT; 0 < a < n. 
n 
We call c; the value of c(x;), fi the value of f(x;), and u; the approximate 
value of the solution u(a;). To compute these approximate values (ui)o<i<n, 
we substitute the differential equation (1.1) with a system of (n — 1) algebraic 
equations. The main idea is to write the differential equation at each point x; 
and to replace the second derivative by an appropriate linear combination of 
the unknowns u;. To do so, we use Taylor’s formula by assuming that u(x) is 
four times continuously differentiable: 


where 67, 8* € (0,1). Adding these two equations, we get 


—u" (x;) = 2u(ax;) — u(aj—-1) = u(ai41) O(n-2). 


n-2 


Neglecting the lowest-order term n~? yields a “finite difference” formula, or 
discrete derivative 


—u" (xi) & 


Substituting —u” (x;) with its discrete approximation in the partial differential 
equation (1.1), we get (n — 1) algebraic equations 
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= Hiu = fyo 1<i<n-l, 
completed by the two boundary conditions 
uo=Q, Un=®. 


Since the dependence in u; is linear in these equations, we obtain a so-called 
linear system (see Chapter 5) of size (n — 1): 


Anu = b™, (1.2) 
where u‘” is the vector of entries (u1,---;Un—1), while b™ is a vector and 
An a matrix defined by 

2+45 -1 0 whos 0 
Ta i fi tan? 
-1 2+3 `. a : fe 
An=n | 0 si o [| = 
Crs fn—2 
ioe =i 2 
0 0 -1 24+% In-1 + Bn 


The matrix A, is said to be tridiagonal, since it has nonzero entries only on 
its main diagonal and on its two closest diagonals (the subdiagonal and the 
superdiagonal). 

One can prove (see Lemma 5.3.2) that the matrix An is invertible, so that 
there exists a unique solution u\” of the linear system (1.2). Even more, it 
is possible to prove that the solution of the linear system (1.2) is a correct 
approximation of the exact solution of the boundary value problem (1.1). 
We said that the previous finite difference method converges as the number 
of intervals n increases. This is actually a delicate result (see, e.g., [3] for a 
proof), and we content ourselves in stating it without proof. 


Theorem 1.1.1. Assume that the solution u(x) of (1.1) is of class C* on 
(0, 1]. Then the finite difference method converges in the sense that 


(n) aje nn 
ex al”? — uled < gaa su, lu") 


Figure 1.1 shows the exact and approximate (n = 20) solutions of equation 
(1.1) on JO, 1[, where the functions c and f are chosen so that the exact solution 
is u(x) = x sin(2rx): 


c(x)=4, f(x) =4(n? + 1)u(x)— 4r cos(2rzr), œa=ß8=0. 


The problem of solving the differential equation (1.1) has thus been reduced 
to solving a linear system. In practice, these linear systems are very large. 
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0.6; 


0.2; 


—0.2} 


0 04 08 12 16 2 


Fig. 1.1. Computation of an approximate solution of (1.1) and comparison with 
the exact solution. 


Indeed, most physical phenomena are three-dimensional (unlike our simpli- 
fied example in one spatial dimension). A numerical simulation requires the 
discretization of a three-dimensional domain. For instance, if we decide to 
place 100 discretization points in each spatial direction, the total number of 
points, or unknowns, is 1 million (100°), and hence the linear system to be 
solved is of size 1 million. This is a typical size for such a system even if some 
are smaller...or larger. In practice, one needs to have at one’s disposal high- 
performance algorithms for solving such linear systems, that is, fast algorithms 
that require little memory storage and feature the highest accuracy possible. 
This last point is a delicate challenge because of the inevitable rounding er- 
rors (an issue that will be discussed in Section 5.3.1). Solving linear systems 
efficiently is the topic of Chapters 5 to 9. 


1.2 Least Squares Fitting 


We now consider a data analysis problem. Assume that during a physical 
experiment, we measure a magnitude or quantity y that depends on a real 
parameter t. We carry out m experiments and different measurements by 
varying the parameter t. The problem is to find a way of deducing from 
these m measurements a very simple experimental law that enables one to 
approximate as well as possible the studied quantity as an (n — 1)th-degree 
polynomial (at most) of the parameter. Let us remark that the form of the 
experimental law is imposed (it is polynomial); but the coefficients of this 
polynomial are unknown. In general, the number of measurements m is very 
large with respect to the degree n of the sought-after polynomial. In other 
words, given m values of the parameter (t;)?, and m corresponding values of 
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m 


the measures (y;)/”,, we look for a polynomial p € P,,_1, the set of polynomials 
of one variable t of degree less than or equal to n— 1, that minimizes the error 
between the experimental value y; and the predicted theoretical value p(t;). 
Here, the error is measured in the sense of “least squares fitting,” that is, we 
minimize the sum of the squares of the individual errors, namely 


B= >, lvi — p(ti)l’. (1.3) 


We write p in a basis (p3) of Prai 


n—l1 


p(t) = $ aj;(t). (1.4) 
j=0 


The quantity E defined by (1.3) is a quadratic function of the n coefficients 
ai, since p(t;) depends linearly on these coefficients. In conclusion, we have to 
minimize the function E with respect to the n variables (ao, a1,...,@n—1)- 


Linear regression. Let us first study the case n = 2, which comes down to 
looking for a straight line to approximate the experimental values. This line is 
called the “least squares fitting” line, or linear regression line. In this case, we 
choose a basis yo(t) = 1, yi(t) = t, and we set p(t) = ap + ait. The quantity 
to minimize is reduced to 


m 
E(ao, a1) = 5 lyi = (ao + ayt,)|? 
i=1 


= Aaj + Bad + 2Caoaı + 2Da, + 2Eao + F 


A=) t, B=m, C=} t; 
i=l a i I 

D=- tim B=- m F=> y 
i=l i=1 i=1 


Noting that A > 0, we can factor E (ao, a1) as 


Q DY C? CD D? 
Blaa) = A (a + Gao + 2) +(B =) 42 (2 ran A 


The coefficient of aĝ is also positive if the values of the parameter t; are not 
all equal, which is assumed henceforth: 


2 


m m 2 m m 
ap—ct=m(So@) - (So) =m) T 
/ : 


| 


6 Introduction 


We can thus rewrite E(ag, a1) as follows: 


2 
CGC. DV’ pa 
Bla.) = A (a + Sao + 2) +A (a+ =) +G, (1.5) 


2 —_GD)2 
where A= B— oa and G = F D- (2-4) The function E has then a 
unique minimum point given by 
E — €P C D 
ao = —-———*, a =-——a-—. 
A 1 A” A 


Example 1.2.1. Table 1.1 (source I.N.S.E.E.) gives data on the evolution of 
the cost construction index taken in the first trimester of every year, from 
1990 to 2000. 


1990|1991|1992|1993|1994|1995|1996|1997|1998|1999 2000| 
939 | 972 |1006/1022|1016|1011|1038/1047|1058|1071 1083] 


Table 1.1. Evolution of the cost construction index. 


Figure 1.2 displays the least squares fitting line of best approximation for 
Table 1.1, that is, a line “deviating the least” from the cloud of given points. 
This example (n = 2) is simple enough that we can solve it exactly “by hand.” 
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Fig. 1.2. Approximation of the data of Table 1.1 by a line. 


Polynomial regression. We return now to the general case n > 2. We denote 
by b € R™ the vector of measurements, by c € R™ that of parameters, by 
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x € R” that of unknowns, and by q(t) € R™ that of the predictions by the 
polynomial p: 


yı ty ao (ti) 

y2 t2 ay p(t2) 
b= ? C= ’ T= $ q(t) = . 

Ym tm An-1 D(tm) 


As already observed, the polynomial p, and accordingly the vector q(t), depend 
linearly on x (its coefficients in the basis of y;(t)). Namely q(t) = Ax and 
E = ||Az — b||? with 


polti) Yilti) --- Yn—i(tr) 
yo(t2) yilt2) .-. Yn-1(te) 
Po(tm) piltm) ++» Pn-1(tm) 
where ||.|| denotes here the Euclidean norm of R™. The least squares fitting 
problem reads then, find x € R” such that 
|| Aa — b|| = inf || Au — bl]. (1.6) 


We shall prove (see Chapter 7) that x € R” is a solution of the minimization 
problem (1.6) if and only if x is solution of the so-called normal equations 


A’ Az = A*b. (1.7) 
The solutions of the least squares fitting problem are therefore given by solving 
either the linear system (1.7) or the minimization problem (1.6). Figure 1.3 


shows the solution of this equation for m = 11, n = 5, and the data y; of 
Table 1.1. 


1100 


99800 1992 1994 1996 1998 2000 


Fig. 1.3. Approximation of the data of Table 1.1 by a fourth-degree polynomial. 
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Multiple variables regression. So far we have assumed that the measured 
physical quantity y depended on a single real parameter t. We now consider 
the case that the experiments and the measured quantity depends on n pa- 
rameters. We still have m experiments and each of them yields a possibly 
different measure. The goal is to deduce from these m measurements a very 
simple experimental law that approximates the physical quantity as a linear 
combination of the n parameters. Note that the form of this experimental 
law is imposed (it is linear), whereas the coefficients of this linear form are 
unknown. 

In other words, let a; € R” be a vector, the entries of which are the 
values of the parameters for the ith experiment, and let f(a) be the unknown 
function from R” into R that gives the measured quantity in terms of the 
parameters. The linear regression problem consists in finding a vector x € R” 
(the entries of which are the coefficients of this experimental law or linear 
regression) satisfying 


m 


X IF (@) — (ai, 2) n| = = min Do ai) — (ai, y) nl", 


ER” 
i=l y 


where (.,.)n denotes the scalar product in R”. Hence, we attempt to best 
approximate the unknown function f by a linear form. Here “best” means “in 
the least squares sense,” that is, we minimize the error in the Euclidean norm 
of R™. Once again this problem is equivalent to solving the so-called normal 
equations (1.7), where b is the vector of R” whose entries are the f(a;) and 
A is the matrix of Mm,n(R) whose m rows are the vectors a;. The solution 
x € R” of (1.7) is the coefficients of the experimental law. 
Least squares problems are discussed at length in Chapter 7. 


1.3 Vibrations of a Mechanical System 


The computation of the eigenvalues and eigenvectors of a matrix is a fun- 
damental mathematical tool for the study of the vibrations of mechanical 
structures. In this context, the eigenvalues are the squares of the frequencies, 
and the eigenvectors are the modes of vibration of the studied system. Con- 
sider, for instance, the computation of the vibration frequencies of a building, 
which is an important problem in order to determine its strength, for example, 
against earthquakes. To simplify the exposition we focus on a toy model, but 
the main ideas are the same for more complex and more realistic models. 
We consider a two-story building whose sufficiently rigid ceilings are as- 
sumed to be point masses m,,™m2,m3. The walls are of negligible mass but 
their elasticity is modeled, as that of a spring, by stiffness coefficients k1, k2, k3 
(the larger is k;, the more rigid or “stiff” is the wall). The horizontal displace- 
ments of the ceilings are denoted by y1, y2, y3, whereas the base of the building 
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Fig. 1.4. A two-story building model. 


is clamped on the ground; see Figure 1.4. In other words, this two-story build- 
ing is represented as a system of three masses linked by springs to a fixed 
support; see Figure 1.5. We write the fundamental equation of mechanics, 


Fig. 1.5. System of three masses linked by springs to a support. 


which asserts that mass multiplied by acceleration is equal to the sum of the 
applied forces. The only forces here are return forces exerted by the springs. 
They are equal to the product of stiffness and elongation of the spring. The 
displacements y1, y2, y3 (with respect to the equilibrium) are functions of time 
t. Their first derivatives, denoted by 71, y2, y3, are velocities, and their second 
derivatives #1, 2,3 are accelerations of the masses m1, M2, M3, respectively. 
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Thus, we deduce the following three equations: 


miyi + kıyı + k2(y1 — y2) = 0, 
Maio + kə(y2 — y1) + kz(y2 — y3) = 0, (1.8) 
m3¥3 + k3(y3 — y2) = 0, 


which read, in matrix form, 


Mi + Ky=0, (1.9) 
where 
yı m, 0 0 kı+kə —kə 0 
y= Y2 ; M= 0 mg 0 ; K= —ko ko + k3 — k3 
Y3 0 0 m3 0 — k3 k3 


The matrix M is called the mass matrix, while K is called the stiffness matrix. 
They are both symmetric. We look for particular solutions of equations (1.8) 
that are periodic (or harmonic) in time in order to represent the vibrations of 
the system. Accordingly, we set 


ylt) = yP, 


where 7 is the basis of imaginary numbers, and w is the vibration frequency of 
the solution. A simple computation shows that in this case, the acceleration 
is 7(t) = —w y(t), and that (1.9) simplifies to 


Ky? = w° My’. (1.10) 


If all masses are equal to 1, then M is the identity matrix, and (1.10) is a 
standard eigenvalue problem for the matrix K, that is, y? is an eigenvector of 
K corresponding to the eigenvalue w?. If the masses take any values, (1.10) 
is a “generalized” eigenvalue problem, that is, w? is an eigenvalue of the ma- 
trix M-'/?k M~‘/2 (on this topic, see Theorem 2.5.3 on the simultaneous 
reduction of a scalar product and a quadratic form). 

Of course, had we considered a building with n floors, we would have 
obtained a similar matrix problem of size n + 1, the matrix M being always 
diagonal, and K tridiagonal. In Chapter 10, several algorithms for the efficient 
computation of eigenvalues and eigenvectors will be discussed. 


1.4 The Vibrating String 


We generalize the example of Section 1.3 to the case of an infinite number of 
masses and springs. More precisely, we pass from a discrete model to a contin- 
uous one. We consider the vibrating string equation that is the generalization 
of (1.8) to an infinite number of masses; for more details, see [3]. This equa- 
tion is again a partial differential equation, very similar to that introduced in 
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Section 1.1. Upon discretization by finite differences, an approximate solution 
of the vibrating string equation is obtained by solving an eigenvalue problem 
for a large matrix. 

The curvilinear abscissa along the string is denoted by x, and time is t. 
The deflection of the string with respect to its horizontal equilibrium position 
is therefore a real-valued function u(t, x). We call u(t, x) its second derivative 
with respect to time t, and u”(t, x) its second derivative with respect to x. 
With no exterior forces, the vibrating string equation reads 


u(t,0)=0, u(t,1)=0, (1.11) 


ee — ku” (t,x) =0, for all x € (0,1), t >0, 
where m and k are the mass and stiffness per unit length of the string. The 
boundary conditions “at both ends” u(t,0) = u(t, 1) = 0 specify that at any 
time t, the string is fixed at its endpoints; see Figure 1.6. We look for special 


Fig. 1.6. Vibrating string problem. 


“vibrating” solutions of (1.11) that are periodic in time and of the form 
iwt 


ult t) =v(r)e 


where w is the vibration frequency of the string. A simple computation shows 
that v(x) is a solution to 


ee = me? (w), for all x € (0,1), (1.12) 
v(0)=0, v(1)=0. 
We say that mw?/k is an eigenvalue, and v(x) is an eigenfunction of problem 
(1.12). In the particular case studied here, solutions of (1.12) can be computed 
explicitly; they are sine functions of period linked to the eigenvalue. However, 
in higher space dimensions, or if the linear mass or the stiffness varies with 
the point x, there is in general no explicit solution of this boundary value 
problem, in which case solutions (w, v(x)) must be determined numerically. 
As in Section 1.1, we compute approximate solutions by a finite difference 
method. Let us recall that this method consists in dividing the interval [0,1] 
into n subintervals of equal size 1/n, where n is an integer chosen according to 
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the desired accuracy (the larger n is, the closer the approximate solution will 
be to the exact solution). We denote by x; = i/n, 0 <i < n, the n + 1 limit 
points of the intervals. We call v; the approximated value of the solution v(x) 
at point x;, and \ an approximation of mw?/k. The idea is to write equation 
(1.12) at each point x;, and substitute the second derivative by an appropriate 
linear combination of the unknowns v; using Taylor’s formula: 

ain) = 2u(ai) — v(@i-1) — (i414) + O(n-2). 


n-2 


Hence, we obtain a system of (n — 1) equations 


2v; = Ua 1 


vi 
5 L = du, 1l<i<n-l, 
ae 


supplemented by the two boundary conditions vo = Un = 0. We can rewrite 
the system in matrix form: 


Ayu = dv, (1.13) 
where v is the vector whose entries are (v1,...,Un—1), and 
2-10... 0 
—] 2 
An=”™ 1g "+. -10 
Sy S72 1 
QO... O —-1 2 


In other words, the pair (A, v) are an eigenvalue and eigenvector of the tridiag- 
onal symmetric matrix An. Since A, is real symmetric, it is diagonalizable (see 
Theorem 2.5.2), so (1.13) admits n — 1 linearly independent solutions. There- 
fore, it is possible to approximately compute the vibrating motion of a string 
by solving a matrix eigenvalue problem. In the case at hand, we can compute 
explicitly the eigenvalues and eigenvectors of matrix A,; see Exercise 5.16. 
More generally, one has to resort to numerical algorithms for approximating 
eigenvalues and eigenvectors of a matrix; see Chapter 10. 


1.5 Image Compression by the SVD Factorization 


A black-and-white image can be identified with a rectangular matrix A the 
size of which is equal to the number of pixels of the image and with entries a; j 
belonging to the range [0,1], where 0 corresponds to a white pixel and 1 toa 
black pixel. Intermediate values 0 < a;,; < 1 correspond to different levels of 
gray. We assume that the size of the image is very large, so it cannot reasonably 
be stored on a computer (not enough disk space) or sent by email (network 
saturation risk). Let us show how the SVD (singular value decomposition) is 
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useful for the compression of images, i.e., for minimizing the storage size of 
an image by replacing it by an approximation that is visually equivalent. 

As we shall see in Section 2.7 the SVD factorization of a matrix A € 
Mm,n(C) of rank r is 


A=VSU* and = E ce 
0 0 
where U € M,(C) and V € Mm(C) are two unitary matrices and X is the 
diagonal matrix equal to diag (41,..., ur), where pı > pa > > py > 0 
are the positive square roots of the eigenvalues of A* A (where A* denotes the 
adjoint matrix of A), called the “singular values” of A. Therefore, computing 
the SVD factorization is a type of eigenvalue problem. Denoting by u; and v; 
the columns of U and V, the SVD factorization of A can also be written 


A=VEU* =X pivu}. (1.14) 
į=1 


Since the singular values 4; > --- > ur > 0 are arranged in decreasing order, 
an approximation of A can easily be obtained by keeping only the k < r first 
terms in (1.14), 


k 

4 

A, = J Mirig. 
i=1 


Actually, Proposition 3.2.1 will prove that A, is the best approximation (in 
some sense) of A among matrices of rank k. Of course, if k is much smaller 
than r (which is less than n and m), approximating A by Aj, yields a big 
saving in terms of memory requirement. 

Indeed, the storage of the whole matrix A E€ Mm,n(R) requires a priori mx 
n scalars. To store the approximation Ax, it suffices, after having performed 
the SVD factorization of A, to store k vectors piu; E€ C™ and k vectors 
u; E€ C”, i.e., a total of k(m + n) scalars. This is worthwhile if k is small 
and if we are satisfied with such an approximation of A. In Figure 1.7, the 
original image is a grid of 500 x 752 pixels, the corresponding matrix A is 
thus of size 500 x 752. We display the original image as well as three images 
corresponding to three approximations A, of A. For k = 10, the image is very 
blurred, but for k = 20, the subject is recognizable. There does not seem to 
be any differences between the image obtained with k = 60 and the original 
image, even though the storage space is divided by 5: 


k(m-+n)  60(500 + 752) 


ma 500x752 © 0%. 


The main computational cost of this method of image compression is the SVD 
factorization of the matrix A. Recall that the singular values of A are the 
positive square roots of the eigenvalues of A*A. Thus, the SVD factorization 
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Fig. 1.7. An application of the SVD factorization: image compression. 


is a variant of the problem of determining the eigenvalues and eigenvectors of 
a matrix which, we shall study in Chapter 10. 


Finally, let us mention that there exist other algorithms that are more 
efficient and cheaper than the SVD factorization for image processing. Their 
analysis is beyond the scope of this course. 


2 


Definition and Properties of Matrices 


Throughout this book we consider matrices with real or complex entries. Most 
of the results are valid for real and complex matrices (but not all of them!). 
That is, in order to avoid tedious repetition, we denote by K a field that is 
either the field of all real numbers R or the field of all complex numbers C, 
i.e., K=R,C. 

The goal of this chapter is to recall basic results and definitions that are 
useful in the sequel. Therefore, many statements are given without proofs. We 
refer to classical courses on linear algebra for further details (see, e.g., [10], 
[16)). 


2.1 Gram-Schmidt Orthonormalization Process 


We consider the vector space K? with the scalar product (x,y) = X; viyi 
if K = R, or the Hermitian product (x,y) = >, iyi if K = C. We describe 
a constructive process for building an orthonormal family out of a family of 
linearly independent vectors in K4, known as the Gram-Schmidt orthonor- 
malization process. This algorithm is often used in numerical linear algebra. 
In the sequel, the notation span {...} is used for the subspace spanned by the 
vectors between parentheses. 


Theorem 2.1.1 (Gram-Schmidt). Let (x1,..., £n) be a linearly indepen- 
dent family in Kĉ. There exists an orthonormal family (y1,...,Yn) such that 


span {y1,.--,Yp} = span {x1,..., 2p}, for any index p in the rangel<p<n. 
If K = R, this family is unique up to a change of sign of each vector yp. If 
K=C, this family is unique up to a multiplicative factor of unit modulus for 


each vector yp. 


Proof. We proceed by induction on n. For n = 1, we define 
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Fig. 2.1. Gram-Schmidt orthonormalization: xı and z2 are linearly independent 
vectors, yı and y2 are orthonormal. 


Tı 
Y= Tp 

zall 
which is the unique vector (up to a change of sign in the real case, and to 
multiplication by a complex number of unit modulus in the complex case) to 
satisfy the desired property. Assume that the result holds up to order n — 1. 
Let (y1,---;Yn—1) be the unique orthonormal family such that 


span {Yi s- +15 Yp} = Span {£1,..., Ept, 


for each index p in the range 1 < p < n — 1. If yn together with the previous 
(y1,--+;Yn—1) satisfy the recurrence property, then, since span {y1,.--,Yn} = 
span {X1,...,%n}, we necessarily have 


n 
Yn = X Qili, 
i=1 


where the a; coefficients belong to K. By assumption at order n — 1, we 
have span {y1,---,Yn—1} = span {21,...,@n—1}, so the linear combination of 
(@1,---,;@n—1) can be replaced by that of (y1,...,Yn—1) to yield 


n—-1 


Yn = Ann + 5 Biyi 
i=1 


with some other coefficients ĝi. Since yn must be orthogonal to all previous 
yi for 1 <i < n — 1, we deduce 


(Yn, Yi) = 0 = An (Zn, Yi) + bi, 


which implies 
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n—1 


i=1 


Since the family (#1,...,@,) is linearly independent and span {y1,..-,Yn—1} = 
span {X1,-.-,;Ln—1}, Jn cannot vanish; see Figure 2.1. Now, yn must have a 
unit norm, it yields 

1 
EAR 


In the real case, we deduce a, = +1/||Ẹn]||. In the complex case, a, is equal to 
1/||9n|| up to a multiplicative factor that is a complex number of modulus 1. 
Such a choice of yn clearly satisfies the recurrence property which concludes 
the proof. 


lanl = 


Remark 2.1.1. If the family (x1,...,£n) is not linearly independent, and 
rather generates a linear subspace of dimension r < n, then the Gram—Schmidt 
orthonormalization process yields an orthonormal family (y1,...,y,) of only 
r vectors. 


2.2 Matrices 
Definitions 


Definition 2.2.1. A matrix A is a rectangular array (a; j) where 


aij E K is the entry in row i and column j, i.e., 


1si<n, 1<j<p’ 


G11 +++ A1,p 


A= 


Gn1 +++ np 


The set of all matrices of size n x p (n rows and p columns) is denoted by 


Mn p(K). 
Definition 2.2.2. Let A and B be two matrices in Mn p(K) defined by 
A= (aij)i<i<n,1<j<p and B= (bij)1<i<n,1<j<p- 
The sum A+B is the matrix in Mn p(K) defined by 
A+ B= (aig + bij) 1<isn, 1<j<p- 
Let A€ KK. The scalar multiplication of A by A is defined by 


AA = (Ndi, j)i<i<n, 1<j<p- 
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Definition 2.2.3. The product of two matrices A E Mn p(K) and B € 
My,q(KX), defined by 


A = (Gi,j)1<i<n,1<j<p and B= (bij)1<i<p,1<j<@ 


is a matrix C = AB in Mn(K) defined by 


P 
C= (es = Yasha) 
k=1 


Remark 2.2.1. The number of columns of A and that of rows of B must be 
equal in order to define their product AB. Otherwise, they need not have the 
same dimensions and belong to the same matrix space. It is an easy exercise to 
check that the matrix multiplication is associative, i.e., (MN)P = M(NP). 
However, even for n = p = q, the multiplication is usually not commutative, 
i.e., ABA BA. 


1sisn,1<j<q 


Definition 2.2.4. For any matrix A = (aij)i<i<n,1<j<p E Mn, p(K), its 
transpose matriz A’ € Mp n(K) is defined by 


A' = (53) 1<j<p,1<é<n- 


In other words, the rows of At are the columns of A, and the columns of At 
are the rows of A: 


A= 


Gn,1 +++ An,p Q1,p «++ An,p 


If At = A (which can happen only if A is a square matriz, i.e., ifn = p), then 
A is said to be symmetric. 


The notation A’ for the transpose matrix of A is not universal. Some authors 
prefer to denote it by A’, or put the exponent t before the matrix, as in ‘A. 


When the number of rows is equal to the number of columns, the matrix is 
said to be a square matrix, the set of which is denoted by M,,(K) = Mp n(K), 
where n is the size of the matrix. The set M,,(K) is thus a noncommutative 
algebra for the multiplication. Its neutral element is the identity matrix, de- 
noted by I (or In if one wants to give a precise indication of the dimension) 
and defined by its entries (4;,;)1<i,j<n, where ði; is the Kronecker symbol 
taking the values 6; = 1 and 6;; = 0 if i Æ j. 


Definition 2.2.5. A matriz A E€ Mn(K) is said to be invertible (or nonsin- 
gular) if there exists a matrix B E€ Mn(K) such that AB = BA = In. This 
matriz B is denoted by A~! and is called the inverse matrix of A. 
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A noninvertible matrix is said to be singular. The kernel, or null space, of a 
matrix A € Mn p(K) is the set of vectors x € KP such that Ax = 0; it is 
denoted by Ker A. The image, or range, of A is the set of vectors y € K” such 
that y = Ax, with x € KP; it is denoted by Im A. The dimension of the linear 
space Im A is called the rank of A; it is denoted by rk A. 


Lemma 2.2.1. For any A E€ Mn(K) the following statements are equivalent: 


1. A is invertible; 

2. Ker A = {0}; 

3. ImA=K"; 

4. there exists B E€ Mn(K) such that AB = In; 
5. there exists B E€ Mn(K) such that BA = In. 


In the last two cases, the matrix B is precisely equal to the inverse A71. 
Lemma 2.2.2. Let A and B be two invertible matrices in Mn(K). Then 


(AB) =p A. 


2.2.1 Trace and Determinant 
In this section we consider only square matrices in Mn(K). 


Definition 2.2.6. The trace of a matrix A = (ai,;)1<i,j<n is the sum of its 
diagonal elements 
nm 
trA= 5 Qi i 
i=1 


Lemma 2.2.3. If A and B are two matrices in Mn(K), then 
tr (AB) = tr (BA). 


Definition 2.2.7. A permutation of order n is a one-to-one mapping from 
the set {1,2,..., n} into itself. We denote by Sn the set of all permutations 
of order n. The signature of a permutation o is the number e(o), equal to +1 
or —1, defined by 


eloj =(-1)) with = p(o)= $, Involi,)), 


I<i<j<n 


where the number Invo(i, j) indicates whether the order between i and j is 
inverted or not by the permutation o, and is defined, fori < j, by 


n _ [O if ali) < alj), 
Inv, (i, j) = T if a(i) > o(9). 
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Definition 2.2.8. The determinant of a square matrix A = (ai j)i<ij<n E 


Mn(K) is 
det A = 5 e(o) [uco 
i=1 


oESn, 
Lemma 2.2.4. Let A and B be two square matrices in Mn(K). Then 


1. det (AB) = (det A)( det B) = det (BA); 
2. det (At) = det (A); 
3. A is invertible if and only if det A # 0. 


2.2.2 Special Matrices 


Definition 2.2.9. A matrix A = (4j,j)1<i,j<n E Mn(K) is said to be diagonal 
if its entries satisfy aij = 0 fori Aj. A diagonal matrix is often denoted by 
A= diag (a11, saci Oren): 


Definition 2.2.10. Let T = (ti; )1<i<n, 1<j<p be a matriz in Mn p(K). It is 
said to be an upper triangular matrix if tij = 0 for all indices (i,j) such that 
i > j. It is said to be a lower triangular matriz if tij =0 for all indices (i, j) 
such that i < j. 


Lemma 2.2.5. Let T be a lower triangular matrix (respectively, upper tri- 
angular) in Mn(K). Its inverse (when it exists) is also a lower triangular 
matrix (respectively, upper triangular) with diagonal entries equal to the in- 
verse of the diagonal entries of T. Let T' be another lower triangular matrix 
(respectively, upper triangular) in M,,(IK). The product TT’ is also a lower 
triangular matrix (respectively, upper triangular) with diagonal entries equal 
to the product of the diagonal entries of T and T". 


Definition 2.2.11. Let A = (ai,j)i<ij<n be a complex square matrix in 


M,,(C). The matriz A* € M,,(C), defined by A* = A = (Gji)i<ijen; is 
the adjoint matriz of A. 


Definition 2.2.12. Let A be a complex square matrix in M,(C). 
1. A is self-adjoint or Hermitian if A = A*; 
2. A is unitary if AT! = A*; 
3. A is normal if AA* = A*A. 


Definition 2.2.13. Let A be a real square matrix in Mn (R). 


1. A is symmetric or self-adjoint if A = A‘ (or equivalently A = A*); 
2. A is orthogonal or unitary if A~! = At (or equivalently A~! = A*); 
3. A is normal if AAt = AtA (or equivalently AA* = A*A). 


2.2 


2.2.3 Rows and Columns 
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A matrix A € Mn p(C) may be defined by its columns c; € C” as 


A = [cil...|ep] 


or by its rows l; € M1ı,p(C) as 


Recalling that cj € Min(C) (the adjoint of c;) is a row vector and ¢7 € C? 


(the adjoint of 4;) is a column vector, we have 


ci 

Æ = | : | =[4]..-14], 
cp 

and for any x € C?, 

Lyx 
= p 

Ar = : = X Zic; 
o i=1 
Lat 


For X € Mm n(C), we have 
XA=X [e]... |cp] = [Xeil...|Xep]. 


Similarly for X € My m(C), we have 


Ly LX 
el eel a 
m hX 
By the same token, given u1,...,Um, vectors in C”, and v1, 


in C?, one can define a product matrix in Mn, p(C) by 


. -Um vectors 
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m 
5 uju; = [u]... [Um] 
i=1 


2.2.4 Row and Column Permutation 


Let A be a matrix in M,,,(IK). We interpret some usual operations on A as 
multiplying A by some other matrices. 


X Multiplying each row i of A by a scalar a; € K is done by left-multiplying 
the matrix A by a diagonal matrix, diag (a1,...,Qn)A. 

X Multiplying each column j of A by 8; € K is done by right-multiplying 
the matrix A by a diagonal matrix, A diag (681, . - - , Gp). 

X To exchange rows lı and lz 4 l, we multiply A on the left by an ele- 
mentary permutation matrix P(l,,l2), which is a square matrix of size n 
defined by its entries: 


= J 0 fie {h, bb}, 
Pii= qa otherwise; 


and for i j, 


TE T if (i,j) € {(l1, l2), (l2, l1)}, 


0 otherwise. 


The matrix P(l,l2) is nothing but the identity matrix whose rows lı and 
lg are permuted. The resulting matrix P(l1,l2)A has exchanged rows lı 
and lə of the initial matrix A. 

X To exchange columns cı and c2, we multiply A on the right by an elemen- 
tary permutation matrix P(c1, C2) of size p. 

X A general permutation matrix is any matrix obtained from the identity 
matrix by permuting its rows (not necessarily only two). Such a permu- 
tation matrix is actually a product of elementary permutation matrices, 
and its inverse is just its transpose. Therefore its determinant is equal to 

+1. 


2.2.5 Block Matrices 


So far, we have discussed matrices with entries belonging to a field K equal to 
R or C. Actually, one can define matrices with entries in a noncommutative 
ring and still keep the same matrix addition and multiplication as introduced 
in Definitions 2.2.2 and 2.2.3. This is of particular interest for the so-called 
“block matrices” that we now describe. 
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Let A be a square matrix in M,,(C). Let (n7)1<r<p be a family of positive 
integers such that Xf; nz = n. Let (e1,...,€n) be the canonical basis of 
C”. We call Vy the subspace of C” spanned by the first nı basis vectors, V2 
the subspace of C” spanned by the next nz basis vectors, and so on up to Vp 
spanned by the last np basis vectors. The dimension of each Vz is nz. Let Ar, J 
be the submatrix (or block) of size ny x nj defined as the restriction of A on 
the domain space Vy into the target space V7. We write 


Ái oie Alp 

A=[ i: 
Apt ++x App 

In other words, A has been partitioned into p horizontal and vertical strips 


of unequal sizes (n7)1<7<p. The diagonal blocks A;,; are square matrices, but 
usually not the off-diagonal blocks Az J, I # J. 


Lemma 2.2.6. Let A = (Ar, J)1<I,J<p and B = (Br J)1<1,J<p be two block 
matrices with entries Ar z and Br y belonging to Mn, n; (K). Let C = AB be 
the usual matriz product of A and B. Then C can also be written as a block 
matrix with entries C = (Cr, J)1<1,J<p given by the following block multipli- 
cation rule: 


Pp 
Cry = 5 A, KBs, forall < I,J <p, 
R=1 


where Cr yz has the same size as Ar j and Br y. 


It is essential that A and B share the same block partitioning, i.e., Az z and 
Br z have equal sizes, in order to correctly define the product Az g times 
Bx,7 in the above lemma. One must also keep in mind that the matrix multi- 
plication is not commutative, so the order of multiplication in the above block 
multiplication rule is important. Although block matrices are very handy (and 
we shall use them frequently in the sequel), not all matrix operations can be 
generalized to block matrices. In particular, there is no block determinant 
rule. 
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In the sequel and unless mentioned otherwise, we assume that all matrices are 
square and complex (which encompasses the real case as well). 


Definition 2.3.1. Let A € M,,(C). The characteristic polynomial of A is the 
polynomial Pa(A) defined on C by 


P(A) = det (A — XJ). 
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It is a polynomial of degree equal to n. It has thus n roots in C (for a proof, 
see for instance [10]), which we call the eigenvalues of A. The algebraic mul- 
tiplicity of an eigenvalue is its multiplicity as a root of Pa(A). An eigenvalue 
whose algebraic multiplicity is equal to one is said to be a simple eigenvalue, 
otherwise, it is called a multiple eigenvalue. 

We call a nonzero vector x € C” such that Ax = Ax the eigenvector of A 
associated with the eigenvalue X. 


We shall sometimes denote by (A) an eigenvalue of A. The set of eigenvalues 
of a matrix A is called spectrum of A and is denoted by o(A). 


Definition 2.3.2. We call the maximum of the moduli of the eigenvalues of 
a matrix A E€ M,(C) the spectral radius of A, and we denote it by o(A). 


Let us make some remarks. 


1. If À is an eigenvalue of A, then there always exists a corresponding eigen- 
vector x, namely a vector x # 0 in C” such that Ax = Ax (x is not 
unique). Indeed, P4(A) = 0 implies that the matrix (A — AJ) is singular. 
In particular, its kernel is not reduced to the zero vector. Conversely, if 
there exists x Æ 0 such that Ax = Az, then A is an eigenvalue of A. 

2. If A is real, there may exist complex eigenvalues of A. 

3. The characteristic polynomial (and accordingly the eigenvalues) is invari- 
ant under basis change, since 


det (Q~*AQ — AI) = det (A — AI), 


for any invertible matrix Q. 

4. There exist two distinct ways of enumerating the eigenvalues of a matrix 
of size n. Either they are denoted by (A1,..., An) (i-e., we list the roots 
of the characteristic polynomial repeating a root as many times as its 
multiplicity) and we say “eigenvalues repeated with multiplicity,” or we 
denote them by (A1,...,Ap) with 1 < p < n keeping only the distinct 
roots of the characteristic polynomial (i.e., an eigenvalue appears only 
once in this list no matter its algebraic multiplicity) and we say “distinct 
eigenvalues.” 


Remark 2.8.1. The eigenvalues of a Hermitian matrix are real. Actually, if A is 
an eigenvalue of a Hermitian matrix A, and u Æ 0 a corresponding eigenvector, 
we have 


Alla]? = (Au, u) = (Au, u) = (u, A*u) = (u, Au) = (u, Au) = Allull?, 
which shows that \ = À, i.e., A € R. 


Definition 2.3.3. Let A be an eigenvalue of A. We call the vector subspace 
defined by 
E, = Ker (A — AI) 
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the eigensubspace associated with the eigenvalue A. We call the vector subspace 
defined by 
Fy = J Ker (A -= A1)" 
k>1 


the generalized eigenspace associated with A. 


Remark 2.3.2. In the definition of the generalized eigenspace Fy, the union of 
the kernels of (A — AT)* is finite, i.e., there exists an integer ko such that 


Fy= |] Ker(A-aAl)*. 
1<k<ko 


Indeed, the sequence of vector subspaces Ker (A—AJ)* is an increasing nested 
sequence in a space of finite dimension. For k larger than an integer ko the 
sequence of dimensions is stationary; otherwise, this would contradict the 
finiteness of the dimension of the space C”. Consequently, for k > ko, all 
spaces Ker (A — AI)! are equal to Ker (A — AI)*°. 


Definition 2.3.4. Let P(X) = ae a;X* be a polynomial on C and A a 
matriz of M,(C). The corresponding matrix polynomial P(A) is defined as 
P(A) = x2, A’. 
Lemma 2.3.1. If Ax = Ax with x # 0, then P(A)x = P(A) for all poly- 
nomials P(X). In other words, if A is an eigenvalue of A, then P(X) is an 
eigenvalue of P(A). 


Theorem 2.3.1 (Cayley-Hamilton). Let Pa(A) = det(A — AI) be the 
characteristic polynomial of A. We have 


P4(A) =0. 


Remark 2.8.3. The Cayley—Hamilton theorem shows that the smallest-degree 
possible for a polynomial that vanishes at A is less than or equal to n. This 
smallest degree may be strictly less than n. We call the smallest-degree poly- 
nomial that vanishes at A and whose highest-degree term has coefficient 1 the 
minimal polynomial of A 


Theorem 2.3.2 (Spectral decomposition). Consider a matrix A E€ M,(C) 
that has p distinct eigenvalues (A1,...,Ap), with 1 < p < n, of algebraic 
multiplicity n1,...,Np» with 1 < ni < n and Na ni =n. Then its generalized 
eigenspaces satisfy 


C” = PfP, Fy, = Ker(A-AD)™, and dim Fy =ni. 


We recall that ® denotes the direct sum of subspaces. More precisely, 
C” = @f_, Fy, means that any vector x € C” can be uniquely decomposed as 


2y? pi wi i 
z= J 1 2" with x’ € F),. 
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Remark 2.8.4. Theorem 2.3.2 can be interpreted as follows. Let B; be a basis 
of the generalized eigenspace F,. The union of all (B;)1<i<p form a basis B 
of C”. Let P be the change of basis matrix from the canonical basis to B. 
Since each Fy, is stable by A, we obtain a new matrix that is diagonal by 
blocks in the basis B, that is, 


Aj 0 

P-1AP = 
0 Ap 
where each A; is a square matrix of size n;. We shall see in the next section 
that by a suitable choice of the basis B;, each block A; can be written as an 
upper triangular matrix with the eigenvalue A; on its diagonal. The Jordan 


form (cf. [9], [10]) allows us to simplify further the structure of this triangular 
matrix. 


2.4 Matrix Triangularization 


There exist classes of particularly simple matrices. For instance, diagonal ma- 
trices are matrices A = (ai j)1<ij<n such that a;,; = 0 if i # j, and upper 
(respectively, lower) triangular matrices are matrices such that a;,; = 0 if 
i > j (respectively, if i < j). Reducing a matrix is the process of transforming 
it by a change of basis into one of these particular forms. 


Definition 2.4.1. A matriz A E€ M,(C) can be reduced to triangular form 
(respectively, to diagonal form) if there exists a nonsingular matriz P and a 
triangular matriz T (respectively, diagonal matrix D) such that 


A= PTP (respectively, A = PDP"). 


Remark 2.4.1. The matrices A and T (or D) are similar: they correspond to 
the same linear transformation expressed in two different bases, and P is the 
matrix of this change of basis. More precisely, if this linear transformation 
has A for its matrix in the basis B = (e;)1<i<n, and T (or D) in the basis 
B’ = (fiji<i<n, then P is the matrix for passing from B to B’, and we have 
P= (Pii)i<ij<n with pij = fřei. Furthermore, when A can be diagonalized, 
the column vectors of P are eigenvectors of A. 

If A can be reduced to diagonal or triangular form, then the eigenvalues 
of A, repeated with their algebraic multiplicities (A1,..., An), appear on the 
diagonal of D, or of T. In other words, we have 
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In all cases, the characteristic polynomial of A is 


P(A) = det (A — AI) = Tle =), 


i=l 


Proposition 2.4.1. Any matrix A E€ M,(C) can be reduced to triangular 
form. 


Proof. We proceed by induction on the dimension n. The proposition is 
obviously true for n = 1. We assume that it holds up to order n — 1. For 
A € M,,(C), its characteristic polynomial det (A — AJ) has at least one root 
A, € C with a corresponding eigenvector e; Æ 0 such that Ae, = A ,e,. We 
complement e; with other vectors (e2,...,@n) to obtain a basis of C”. For 
2 < j < n, there exist coefficients a; and b; j such that 


Ae; = AFEL + >D bi jei. (2.1) 


i=2 


We denote by B the matrix of size n — 1 defined by its entries (0;,;)2<i,j<n- 
Introducing the change of basis matrix P, for passing from the canonical basis 
to (€1,...,€n), identity (2.1) is equivalent to 


Al Ag... An 

0 
PO APs 

: B 

0 


Applying the induction assumption, there exists a nonsingular matrix P> of 
size n — 1 such that Py 1 BP, = To, where T> is an upper triangular matrix of 
order n — 1. From P we create a matrix P; of size n defined by 


10...0 
0 
P3=]. 
a P> 
0 
Then, setting P = P, P; yields 
A Be ls Bn At Ba... Bn 
0 0 
P-1AP=| . =|. =T, 
: P BP, R 
0 0 


where T is an upper triangular matrix and (82,..., Bn) = (a2,.--,Qn) Pa. 
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Remark 2.4.2. If A is real, the result still applies, but T and P may be com- 
plex! For instance, the matrix 
0-1 
4=(10) 


has complex eigenvalues and eigenvectors. 


We already know that all matrices can be reduced to triangular form. 
The purpose of the next theorem is to prove furthermore that this may be 
performed through a change of orthonormal basis. 


Theorem 2.4.1 (Schur Factorization). For any matriz A E€ M,,(C) there 
exists a unitary matriz U (i.e., U7! = U*) such that U-1AU is triangular. 


Proof. Let (ei); be the canonical basis and let (f;)?_, be the basis in which 
A is triangular. We call P the corresponding change of basis matrix, that is, 
the matrix whose columns are the vectors (fi). Proposition 2.4.1 tells us 
that A = PTP~!. We apply the Gram-Schmidt orthonormalization process 
(see Theorem 2.1.1) to the basis (f;)#_,, which yields an orthonormal basis 
(gi); such that for any 1 <i <n, 


span {gi,-.-, gi} = span { fi,- , fi} 


Since AP = PT with T upper triangular, keeping only the first i columns of 
this equality gives 


span{Afi,..., Afi} C span{fi,..., fi}, for alll <i<n. (2.2) 
Thus, we deduce that 
span {Agy,..., Agi} Cc span {gi,--+59i} - (2.3) 


Conversely, (2.3) implies that there exists an upper triangular matrix R such 
that AU = UR, where U is the unitary matrix whose columns are the ortho- 
normal vectors (g;)/_). 


2.5 Matrix Diagonalization 
Proposition 2.5.1. Let A E€ M,(C) with distinct eigenvalues (A1,...,p), 
1<p< n. The matrix A is diagonalizable if and only if 


C= Ê E),, 


i=l 


or, equivalently, if and only if Fy, = Ey, for anyl <i<p. 
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Proof. If C” = @?_,£),, then A is diagonal in a basis obtained as the union 
of bases for the subspaces Fy,. Conversely, if there exists a nonsingular matrix 
P such that P~'AP is diagonal, it is clear that C” = @?_, F),. 

What is more, we always have Ey, C Fy, and C” = @?_, Fy, by virtue of 
Theorem 2.3.2. Hence, the identities Fy, = Ey, for all 1 < i < p are equivalent 
to requiring that A be diagonalizable. 


In general, not every matrix is diagonalizable. Moreover, there is no simple 
characterization of the set of diagonalizable matrices. However, if we restrict 
ourselves to matrices that are diagonalizable in an orthonormal basis of eigen- 
vectors, then such matrices have an elementary characterization. Namely, the 
set of diagonalizable matrices in an orthonormal basis coincides with the set 
of normal matrices, i.e., satisfying AA* = A* A. 


Theorem 2.5.1 (Diagonalization). A matrix A € M,C) is normal (i.e., 
AA* = A*A) if and only if there exists a unitary matrix U such that 


A = U diag (A1,...,An)U~}, 
where (Ai, ..., Àn) are the eigenvalues of A. 


Remark 2.5.1. There are diagonalizable matrices in a nonorthonormal basis 
that are not normal. For instance, the matrix 


—1 2 
a 
is not normal, because 


t 5 2 t 1 —2 
Aat= (5 i) eatas (J, cae 


Nevertheless A is diagonalizable in a basis of eigenvectors, but these vectors 
are not orthogonal: 


E i [1L 12y f-1 0 1 =i 
ECE] 
Proof of Theorem 2.5.1. Clearly, a matrix A = U DU*, with U unitary and 
D diagonal, is normal. Conversely, we already know by Theorem 2.4.1 that 
any matrix A can be reduced to triangular form in an orthonormal basis. In 
other words, there exists a unitary matrix U and an upper triangular matrix 
T such that A = UTU*. Now, AA* = A*A implies that TT* = T*T, i.e., the 
matrix T is normal. Let us show that a matrix that is both triangular and 
normal is diagonal. By definition we have T = (t;,;)1<i,j<n with t;,; = 0 if 
i > j. Identifying the entry in the first row and first column of the product 
T*T = TT*, we deduce that 
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n 


ltal? = 5 ltiel’, 


k=1 


which yields tı œ = 0 for all 2 < k < n, i.e., the first row of T has only zero 
entries, except for the diagonal entry. By induction, we assume that the first 
(i — 1) rows of T have only zeros, except for the diagonal entries. Identifying 
the entry in the ith row and ith column of the product T*T = TT* yields 


n 
la = 5 liel 
k=i 


so that ti% = 0 for all ¿+1 < k < n, which means that the ith row of T also 
has only zeros off the diagonal. Hence T is diagonal. 


Remark 2.5.2. Take a matrix A E€ Mn (C) that is diagonalizable in an ortho- 
normal basis, i.e., A = U diag (à1,..., An)U* with U unitary. Another way 
of writing A is to introduce the columns (u;)i<i<n of U (which are also the 
eigenvectors of A), and to decompose A = 37)", A\juju*. 


Theorem 2.5.2. A matriz A E€ M,(C) is self-adjoint (or Hermitian, i.e., 
A = A*) if and only if it is diagonalizable in an orthonormal basis with real 
eigenvalues, in other words, if there exists a unitary matrix U such that 


A=Udiag(A1,...,An)U~+ with MER. 


Proof. If A =U diag (Ai,...,An)U~", then 


A* =(U-*)* diag Ài,- - , An)U*, 


and since U is unitary and the eigenvalues are real, we have A = A*. Recipro- 
cally, we assume that A = A*. In particular, A is normal, hence diagonalizable 
in an orthonormal basis of eigenvectors. Then, according to Remark 2.3.1, its 
eigenvalues are real. 


We can improve the previous theorem in the case of a real symmetric 
matrix A (which is a special case of a self-adjoint matrix) by asserting that 
the unitary matrix U is also real. 


Corollary 2.5.1. A matrix A E€ M,,(R) is real symmetric if and only if there 
exist a real unitary matrix Q (also called orthogonal, Q7! = Qt) and real 
eigenvalues r1,...,An E R such that 


A = Q diag QujceAnlQ* 


A self-adjoint matrix A € M,(C) (i.e., A* = A) is said to be positive 
definite if all of its eigenvalues are strictly positive. It is said to be nonnegative 
definite (or positive semidefinite) if all of its eigenvalues are nonnegative. 
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Theorem 2.5.3. Let A be a self-adjoint matrix and B a positive definite self- 
adjoint matrix in M,(C). There exists a basis of C” that is both orthonormal 
for the Hermitian product Yt Bx and orthogonal for the Hermitian form Tt Az. 
In other words, there exists a nonsingular matrix M such that 


B=M*M and A= M* diag (W1,...,ln)M 


with m E R. 


2.6 Min—Max Principle 


We establish a variational principle, known as the min-max principle, or the 
Courant—Fisher principle, which gives the eigenvalues of a Hermitian matrix 
as the result of a simple optimization process. 


Definition 2.6.1. Let A be a self-adjoint or Hermitian matrix in M,C), 
i.e., A* = A. The function from C” \ {0} into R defined by 
(Axi) 


Mae) ea) 


is called the Rayleigh quotient of A. 


By virtue of Remark 2.3.1, the Rayleigh quotient of a Hermitian matrix 
is always real. Indeed, for any vector x € C”, we have (Ax, x) = (x, A*x) = 
(x, Ax) = (Ax, x), which thus belongs to R. 


Theorem 2.6.1. Let A be a Hermitian matrix of M,,(C). Its smallest eigen- 
value, denoted by A1, satisfies 


à= min Ra(ar) = 


min (Ax, £), 
2eC”.c40 zec”, |lal|=1 


and both minima are attained for at least one eigenvector e, # 0 satisfying 


Ae, = À1€1. 


Remark 2.6.1. The same kind of result is true for the largest eigenvalue An of 
A, namely, 

n= pet atl alz) eo TE. nt) 
and both maxima are attained for at least one eigenvector e, # 0 satisfying 
Aén = An€n- 


Proof of Theorem 2.6.1. Since A is Hermitian, it is diagonalizable in an 
orthonormal basis of eigenvectors (€1,...,€n). We call Ay < +--+ < An its real 
eigenvalues (see Remark 2.3.1) sorted in increasing order. Let (x1,...,2n) be 
the coordinates of the vector x in this basis. We have 
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n 


(Ag, x) = Naile? Žž >> |xi|? = M (z, 2). 
i=l 1 


i= 


As a consequence, both minima are larger than A,. Moreover, we have Ae; = 
Axe, and |\e;|| = 1, by definition. Therefore, we get 


(Aei,€1) = A1(e1, €1), 


so both minima are attained at this vector e1. 


Theorem 2.6.1 and Remark 2.6.1 can be generalized to all other interme- 
diate eigenvalues as follows. 


Proposition 2.6.1. Let A be a Hermitian matrix with eigenvalues Ay < ++: < 
An. For each index i € {1,...,n}, we have 


Ni = min Ra(x) = max Ra(2), (2.4) 
alspan{ey,...,ei—1} vlLspan{ej+41,...,en} 
where (€1,..-,€n) are the eigenvectors of A associated with (A1,...,An)- 


Remark 2.6.2. In formula (2.4), it should be understood that for i = 1, the 
minimization is carried without any orthogonality constraint on the vector a, 
so we recover the statement of Theorem 2.6.1. Similarly, it should be under- 
stood that for i = n, the maximization is carried without any orthogonality 
constraint on the vector x. 


Proof of Proposition 2.6.1. | With the previous notation, we have x = 
n 


X; 4 ziei. Consequently, xl span {e1,...,e;-1} implies that z = aS Tie 
hence 

(Ag, x) -and (Aei, ei) EN 

(x, x) (ei, €i) 


which completes the proof for the minimization (same argument for the max- 
imum). 


Now we can prove the main result of this section, namely the “min-max” 
principle, also called Courant—Fisher theorem. 


Theorem 2.6.2 (Courant—Fisher or min-max). Let A be a Hermitian 


matrix with eigenvalues 1 < +++ < An. For each indez i € {1,...,n}, we have 
Ni = min max Ra(2) (2.5) 

(@1,.-.,@n—4)EC” aLspan{ay,...,an—i} 
= max min Ra(2). (2.6) 


(a1,...,@i:-1) EC” alLspan{ay,...,ai—1} 


Remark 2.6.8. In the “min-max” formula of (2.5), we first start by carrying 
out, for some fixed family of vectors (a1,...,@n—;) of C”, the maximization of 
the Rayleigh quotient on the subspace orthogonal to span {a1,...,dn—;}. The 


2.7 Singular Values of a Matrix 33 


next stage consists in minimizing the latter result as the family (a1,...,@n_;) 
varies in C”. The “max-min” formula is interpreted in a similar fashion. As 
already mentioned in Remark 2.6.2, in the “max-min” formula, for i = 1, 
minimization in x is done all over C” \ {0}, and there is no maximization on 
the (empty) family (a1,...,a;-1). Hence we retrieve Theorem 2.6.1. The same 
remark applies to the “min-max” formula for i = n too. 


Proof of the Courant—Fisher theorem. We prove only the max-min formula 
(the min-max formula is proved in the same way). First of all, for the partic- 
ular choice of the family (e1,...,e;-1), we have 


min Ra(x) > min Ra(x) = Ai 


ax 
(a4,...,a;-1)€C” vl span{ay,...,ai-1} alspan{ey,...,e;-1} 

Furthermore, for all choices of (a,,...,@;-1), we have 

: tk : 

dim span {a1,...,a@;-1}> >n—itl. 
On the other hand, dim span {e1,...,e;} = i, so the subspace 
L 
span {a1,...,@i-1} N span {e1,..., ei} 


is nonempty since its dimension is necessarily larger than 1. Therefore, we can 
restrict the minimization space to obtain an upper bound: 


min Ra(x) < min Ra(x) 
xlspan{ay,...,ai—1} x€span{ay,...,a;-1}+Nspan{ey,...,e;} 
< max Ra(a) 


x€span{ay,...,a;-1}+Nspan{er,...,ei} 


< max Ra(x) = Xi, 
xespan{ey,...,ei} 
where the last inequality is a consequence of the fact that enlarging the max- 
imization space does not decrease the value of the maximum. Taking the 
maximum with respect to all families (a1, ...,ai—1) in the above equation, we 
conclude that 


max min Ra(a) < ài, 
(ay,...,@;-1)€C” xLspan{aj,...,ai;-1} 


which ends the proof. 


2.7 Singular Values of a Matrix 


Throughout this section, we consider matrices in Mm,n(C) that are not nec- 
essarily square. To define the singular values of a matrix, we first need a 
technical lemma. 
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Lemma 2.7.1. For any A E Mm, n(C), the matrix A* A is Hermitian and has 
real, nonnegative eigenvalues. 


Proof. Obviously A*A is a square Hermitian matrix of size n. We deduce 
from Remark 2.3.1 that its eigenvalues are real. It remains to show that they 
are nonnegative. Let \ be an eigenvalue of A*A, and let x # 0 be a corre- 
sponding eigenvector such that A* Ax = Ax. Taking the Hermitian product of 
this equality with x, we obtain 

(A*Ac,2) _ (Az, Az) _ ||Aall? 


Ve = = ER’, 
(x, 2) (x, 2) æl? 


and the result is proved. 


Definition 2.7.1. The singular values of a matrix A E€ Mm,n(C) are the 
nonnegative square roots of the n eigenvalues of A*A. 


This definition makes sense thanks to Lemma 2.7.1, which proves that the 
eigenvalues of A*A are real nonnegative, so their square roots are real. The 
next lemma shows that the singular values of A could equally be defined as 
the nonnegative square roots of the m eigenvalues of AA*, since both matrices 
A*A and AA* share the same nonzero eigenvalues. 


Lemma 2.7.2. Let A E€ Mm, n(C) and BE Mn,m(C). The nonzero eigenval- 
ues of the matrices AB and BA are the same. 


Proof. Take à an eigenvalue of AB, and u € C™ a corresponding nonzero 
eigenvector such that ABu = Xu. If u € Ker (B), we deduce that Au = 0, and 
since u # 0, à = 0. Consequently, if A 4 0, u does not belong to Ker (B). 
Multiplying the equality by B, we obtain BA(Bu) = A(Bu), where Bu Æ 0, 
which proves that A is also an eigenvalue of BA. 


Remark 2.7.1. A square matrix is nonsingular if and only if its singular values 
are positive. Clearly, if A is nonsingular, so is A*A, and thus its eigenvalues 
(the squared singular values of A) are nonzero. Reciprocally, if A is singular, 
there exists a nonzero vector u such that Au = 0. For this same vector, we 
have A* Au = 0, and therefore A*A is singular too. 


Owing to Theorem 2.5.1, we can characterize the singular values of a nor- 
mal matrix. 


Proposition 2.7.1. The singular values of a normal matrix are the moduli 
of its eigenvalues. 


Proof. Indeed, if a matrix A is normal, there exists a unitary matrix U such 
that A = U*DU with D = diag (å;), and so A*A = (U*DU)*(U* DU) = 
U*(D* D)U. We deduce that the matrices A*A and D*D = diag (|);|*) are 
similar, and have accordingly the same eigenvalues. 
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Remark 2.7.2. As a consequence of Proposition 2.7.1, the spectral radius of a 
normal matrix is equal to its largest singular value. 


We finish this chapter by introducing the “singular value decomposition” 
of a matrix, in short the SVD factorization. 


Theorem 2.7.1 (SVD factorization). Let A € Mm,n(C) be a matriz hav- 
ing r positive singular values. There exist two unitary matrices U € M,(C), 
V € Mm(C), and a diagonal matriz X € Mm »(R) such that 


A=VSU* and £=(9 a (2.7) 
where X = diag (f1,..., Hr), and pı > u2 > ++: > ur > 0 are the positive 
singular values of A. 


Before proving this result, let us make some remarks. Without loss of gener- 
ality we shall assume that m > n, since for m < n we can apply the SVD 
factorization to the matrix A* and deduce the result for A by taking the 
adjoint of (2.7). 


Remark 2.7.3. For n = m, the SVD factorization of Theorem 2.7.1 has nothing 
to do with the usual diagonalization of a matrix because, in general, V is 
different from U so U* is different from V~! (see Definition 2.4.1). In other 
words, Theorem 2.7.1 involves two unitary changes of basis (associated with 
U and V) while Definition 2.4.1 relies on a single change of basis. 


Remark 2.7.4. As a byproduct of Theorem 2.7.1, we obtain that the rank of 
A is equal to r, i.e., the number of nonzero singular values of A. In particular, 
it satisfies r < min(m, n). 


Remark 2.7.5. We have A*A = US*'U*. The columns of matrix U are thus 
the eigenvectors of the Hermitian matrix A*A, and the diagonal entries of 
tE € M,,(R) are the eigenvalues of A*A, i.e., the squares of the singular 
values of A. On the other hand, AA* = VY'Z*V*, so the columns of V are 
the eigenvectors of the other Hermitian matrix AA* and the diagonal entries 
of Et € M,,(R) are the eigenvalues of AA* too. 


Remark 2.7.6. Theorem 2.7.1 can be refined in the case of a real matrix A, by 
showing that both matrices U and V are real too. 


Proof of Theorem 2.7.1. We denote by u; the eigenvectors of A*A corre- 
sponding to the eigenvalues u? (see Lemma 2.7.1), A* Au; = p?u;, and U is 
the unitary matrix defined by U = [u1]... [un]. We have 


A* AU = [A*Auy|...|A* Aun] = [u3un]... u7un] = U diag (u7,..., 42), 


so U*A* AU = X' X, setting 
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ly 0O ... 0 

0 m 

7 0 
A= Un € Mm,n(R) 

0 

0 rae 0 


We arrange in decreasing order the singular values u1 > +++ > Mp > Uri = 
+++ = Hn = 0, of which only the first r are nonzero. We also notice that 


(Aui, Auj) = (A* Aui uj) = p; (us, uz) = Hi Ôi, j» 


and in particular, Au; = 0ifr<i<n.Forl <i<r, ui 40, so we can define 
unit vectors v; € C™ by v; = Au;/p;. These vectors, complemented in order to 
obtain an orthonormal basis v1, .. . , Um of C™, yield a matrix V = [v1]... [Um]. 
Let us check equality (2.7): 


VEU" = [vi]... Wm] SU* = [pivi|...|untnJU* = [Au]... |Auy|0]... |0]U*. 


Since Au; = 0 for r <i < n, we deduce VXU* = AUU* = A. 


Geometrical interpretation. The SVD factorization shows that the image 
of the unit sphere S"~! by a nonsingular matrix A is an ellipsoid. For instance, 
Figure 2.2 displays the image of the unit circle of R? by the 2 x 2 matrix of 
Exercise 2.25. We recall that S”! = {(a1,...,2,)' € R”, 0,2? = 1}. 
Let A be a nonsingular matrix of M,,(R) whose SVD factorization is A = 
VXUt. We wish to characterize the set VXU'S"—!. Since the matrix Ut 
is orthogonal, it transforms any orthonormal basis into another orthonormal 
basis. Hence U‘S"—-1 = S"—1, Then we clearly have XS” = {(z/,,...,2/,J' € 
R”, >>,(24/u:)? = 1}, which is precisely the definition of an ellipsoid E"~! 
of semiaxes piei. Finally, since V is still a matrix of change of orthonormal 
basis, AS”-! = VE"~! is just a rotation of E’~!. To sum up, AS”~! is an 
ellipsoid, of semiaxes uivi, where v; is the ith column of V. 

Pseudoinverse. The SVD factorization allows us to introduce the so-called 


pseudoinverse of a matrix, which generalizes the notion of inverse for rectan- 
gular matrices or square matrices that are singular. 


Definition 2.7.2. Let A = VX'U* be the SVD factorization of some matrix 
A € Mm, n(C) having r nonzero singular values. We call the matrix Ate 
Mn m(C) defined by At =UX"V* with 


7 = 
Ste T a € Mam (R) 


the pseudoinverse matrix of A. 
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-2 


-2 -1 0 1 2 
Fig. 2.2. Image of the unit circle of the plane by the matrix defined in Exercise 


2.25. 


The reader will easily check the following relations: 


a I, 0 - 
T = t E r x o aye 
AtA = USSU = a =A (2.8) 
AAt =vEStyeav(tr Oo) yee Sau (2.9) 
0 0 = tria 
r a 1 
A= 5 hiviuž, and Af = 5 — uiv}. (2.10) 
i=1 i=1 Mi 


In addition, if A has maximal rank (r = n < m), its pseudoinverse is given by 
A= (AATA: 


In particular, if A is a nonsingular square matrix (r = n = m), we obtain 
AtA = AAt = In. Thus At = A~!. In this sense, the pseudoinverse is indeed 
the generalization of the inverse. 


Remark 2.7.7. It can be shown that the pseudoinverse matrix is the only ma- 
trix X satisfying the following conditions (known as the Moore-Penrose con- 
ditions): 


AXA=A, XAX=X, XA=(XA*, AX =(AX)*. 
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2.8 Exercises 


2.1 (x). We fix the dimension n > 2. 


l; 


2. 


What is the vector u in terms of the matrix a defined by the instructions 
a=eye(n,n) ;u=a(:,i) for an integer i between 1 and n? 

We recall that the Matlab instruction rand(m,n) returns a matrix of size 
m x n whose entries are random real numbers in the range [0,1]. For 
n fixed, we define two vectors u=rand(n,1) and v=rand(n,1). Compute 


ai u and the scalar product (w, u). 
2 


using Matlab the vector w = v — 


. Let A be a real square matrix (initialized by rand), and define two matrices 


B and C by B=0.5*(A+A’) and C=0.5*(A-A’) 

(a) Compute the scalar product (Cx, x) for various vectors x. Justify the 
observed result. 

(b) Compute the scalar product (Ba, x) for various vectors x. Check that 
it is equal to (Ax, x) and explain why. 


2.2 (x). The goal of this exercise is to define Matlab functions returning ma- 
trices having special properties that we shall exploit in the upcoming exercises. 
All variables will be initialized by rand. Note that there are several possible 
answers, as it is often the case for computer programs. 


1; 


Write a function (called SymmetricMat (n)) returning a real symmetric 
matrix of size n x n. 


. Write a function (called NonsingularMat (n) ) returning a real nonsingular 


matrix of size n X n. 


. Write a function (called LowNonsingularMat (n) ) returning a real nonsin- 


gular lower triangular matrix of size n x n. 


. Write a function (called UpNonsingularMat (n)) returning a real nonsin- 


gular upper triangular matrix of size n x n. 


. Write a function (called ChanceMat(m,n,p)) returning a real matrix of 


size m xX n whose entries are chosen randomly between the values —p and 
p. 


. Write a function (called BinChanceMat (m,n)) returning a real matrix of 


size m x n whose entries are chosen randomly equal to 0 or 1. 


. Write a function (called HilbertMat (m,n) ) returning the so-called Hilbert 


matrix H € Mm,n(R) defined by its entries: 


1 
H; ; = ————_. 
J i+g-1 


2.3. Define a matrix A by the Matlab instructions 


p=NonsingularMat (n) ;A=p*diag([ones(n-1,1); e])*inv(p), 


where e is a real number. What is the determinant of A? (Do not use Matlab 
to answer.) Take e = 107%, n = 5, and compute by Matlab the determinant 
of A. What do you notice? 
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2.4. Set A=[1:3; 4:6; 7:9; 10:12; 13:15]. What is the rank of A? (use 
the function rank). Let B and C be two nonsingular matrices respectively 
of size 3 x 3 and 5 x 5 given by the function NonsingularMat defined in 
Exercise 2.2. Compare the ranks of the products CA and AB. Justify your 
experimental results. 


2.5. Vary n from 1 to 10 and 


1. determine the rank of a matrix A defined by A=rand(8,n)*rand(n,6). 
What is going on? 

2. Same question for A=BinChanceMat (8,n)*BinChanceMat (n,6). 

3. Justify your observations. 


2.6. We fix the dimension n = 5. 


1. For any integer r between 1 and 5, initialize (with rand) r vectors u; and 
define a square matrix A = )7;_, ujui. Compare the rank of A with r. 

2. Same question for vectors generated by BinChanceMat. 

3. Justify your observations. 


2.7 (x). Write a function (called MatRank(m,n,r)) returning a real matrix of 
size m x n and of fixed rank r. 


2.8. Define rectangular matrices 
A=[1:33;4:6;7:9;10:12] and B=[-1 2 3;4:6;7:9;10:12]. 


1. Are the square matrices A*A and BtB nonsingular? 
2. Determine the rank of each of the two matrices. 

3. Justify the answers to the first question. 

4. Same questions for At and B*. 


2.9. Let A be a matrix defined by A=MatRank(n,n,r) with r < n and Q a 
matrix defined by Q=nul1(A’), that is, a matrix whose columns form a basis 
of the null space of A’. Let u be a column of Q, compute the rank of A+ uu. 
Prove the observed result. 


2.10 (*). Let a1,...,@, be a family of n vectors of R”, and A E€ Mm,n(R) the 
matrix whose columns are the vectors (a;)1<j<n. We denote by r the rank of A. 
The goal is to write a program that delivers an orthonormal family of r vectors 
U1,---,U, of R™ by applying the Gram-Schmidt orthogonalization procedure 
to A. We consider the following algorithm, written here in pseudolanguage 
(see Chapter 4): 
For pHi / n 
s=0 
For k=1/p-1 
s= s + (Ap, Uk) Uk 
End 
S=Ap)—S 
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If ||s|| #0 then 
Up = s/|\s| 
Else 
Up = 0 
End 
End 
Gram-Schmidt Algorithm. 


1. Determine the computational complexity (number of multiplications and 
divisions for n large) of this algorithm. 

2. Write a program GramSchmidt whose input argument is A and whose 
output argument is a matrix the pth column of which is the vector up, 
if it exists, and zero otherwise. Test this program with A € Myo5(R) 
defined by 
n=5;u=1:n; u=u’; c2=cos(2*u); c=cos(u); s=sin(u); 

A=[u c2 ones(n,1) rand()*c.*c exp(u) s.*s]; 

We denote by U the matrix obtained by applying the orthonormalization 
procedure to A. 

(a) Compute UU* and U'U. Comment. 

(b) Apply the GramSchmidt algorithm to U. What do you notice? 

3. Change the program GramSchmidt into a program GramSchmidt1 that 
returns a matrix whose first r columns are the r vectors uz, and whose 
last n — r columns are zero. 


2.11 (x). The goal of this exercise is to study a modified Gram-Schmidt 
algorithm. With the notation of the previous exercise, each time a new vector 
Up is found, we may subtract from each ap (k > p) its component along up. 
Assume now that the vectors a, are linearly independent in R”. 


e Set uy = aı/||aı|| and replace each ap with ap — (u1, apg)u1 for all k > 1. 
e If the first p—1 vectors u1,...,Up—1 are known, set up = ap/||ap|| and 
replace each a, with a, — (Up, ak)Up for all k > p. 


1. Write a function MGramSchmidt coding this algorithm. 

2. Fix m = n = 10. Compare both the Gram-Schmidt and modified Gram- 
Schmidt algorithms (by checking the orthonormality of the vectors) for a 
matrix whose entries are chosen by the function rand, then for a Hilbert 
matrix H € Mm,n(R). 

3. If we have at our disposal only the Gram-Schmidt algorithm, explain how 
to improve the computation of up. 


2.12 («). The goal of the following exercises is to compare different definitions 
of matrices. We use the Matlab functions tic and toc to estimate the running 
time of Matlab. The function toc returns the time elapsed since the last call 
tic. Run the following instructions plus2pt 
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n=400;tic; 
for j=1:n for i=1:n,a(i,j)=cos(i)*sin(j);end;end;t1i=toc;clear a; 


tic;a=zeros(n,n); 
for~j=1:n°fori=1:n,~a(i,j)=cos(i)*sin(j) ;end;end;t2=toc;clear~a; 


tic;a=zeros(n,n); 
for~i=1:n~for j=1:n,~a(i, j)=cos(i)*sin(j) ;end;end;t3=toc;clear a; 


tica=zeros(n,n) ;a=cos(1:n)’*sin(1:n) ;t4=toc; 
Display the variables t1, t2, t3, and t4. Explain. 


2.13. Define a matrix A of size n x n (vary n) by the instructions 
A=rand(n,n) ;A=triu(A)-diag(diag(A)). What is the purpose of triu? 
Compute the powers of A. Justify the observed result. 


2.14. Let H be the Hilbert matrix of size 6 x 6. 


1. Compute the eigenvalues \ of H (use the function eig). 
2. Compute the eigenvectors u of H. 
3. Verify the relations Hu = Au. 


2.15. Define a matrix A by the instructions 
P=[1 22 i; 23 3 27 =<1 12-2; 13 2 1l; 
D=[2 1 00; 0210; 0030; 000 4); 
A=P*D*inv (P); 


1. Without using Matlab, give the eigenvalues of A. 

2. Compute the eigenvalues by Matlab. What do you notice? 

3. For n = 3 and n = 10, compute the eigenvalues of A” with Matlab, and 
compare with their exact values. What do you observe? 

4. Diagonalize A using the function eig. Comment. 


2.16. For various values of n, compare the spectra of the matrices A and At 
with A=rand(n,n). Justify the answer. 


2.17. Fix the dimension n. For u and v two vectors of R” chosen randomly 
by rand, determine the spectrum of In + uvt. What are your experimental 
observations? Rigorously prove the observed result. 


2.18 («). Define a matrix A=[10 2;2 4]. Plot the curve of the Rayleigh quo- 
tient x ++ xt Ax, where x spans the unit circle of the plane. What are the 
maximal and minimal attained values? Compare with the eigenvalues of the 
matrix A. Explain. 

Hint: use the Matlab function plot3. 


2.19. For various values of the integers m and n, compare the spectra of AA‘ 
and AtA, where A=rand(n,m). Justify the observations. 
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2.20 (x). Show that the following function PdSMat returns a positive definite 
symmetric matrix of size n x n 


function A=PdSMat (n) 

A=SymmetricMat (n) ; // defined in Exercise 2.2 
[P ,D]=eig(A) ;D=abs(D) ; 

D=D+norm(D) *eye(size(D)); 

A=P*D*inv(P) ; 


1. For different values of n, compute the determinant of A=PdSMat (n). What 
do you observe? Justify. 

2. Fix n = 10. For k varying from 1 to n, define a matrix A, of size k x k 
by Ak = A(1:k,1:k). Check that the determinants of all the matrices A; 
are positive. Prove this result. 

3. Are the eigenvalues of Ap eigenvalues of A? 


2.21. 


1. Let A=SymmetricMat (n). Compute the eigenvectors u; and the eigenvalues 
A; of A. Compute J`}; Aiu;ut and compare this matrix with A. What 
do you observe? 

2. Denote by D and P the matrices defined by [P,D]=eig(A). We modify 
an entry of the diagonal matrix D by setting D(1,2) = 1, and we define a 
matrix B by B=P*D*inv(P). Compute the eigenvectors v; and eigenvalues 
hi of B. Compute `}; Hivivt and compare this matrix with B. What 
do you observe? 

3. Justify. 


2.22. For various values of n, compute the rank of the matrix defined by the 
instruction rank(rand(n,1)*rand(1,n)). What do you notice? The goal is 
to prove the observed result. 


1. Let u and v be two nonzero vectors of R”. What is the rank of the matrix 
A = vut? 

2. Let A € Mn (R) be a rank-one matrix. Show that there exist two vectors 
u and v of R” such that A = vut. 


2.23 («). Define a square matrix by A=rand (n,n). 


1. Compute the spectrum of A. 
2. For 1 < i < n, define y; = Z digi |a;,;| and denote by D; the (so-called 
Gershgorin) disk of radius q; and center a,j: 


D; = {z eC, |z-a| < n} 
(a) Compute the y;’s with the Matlab function sum. 


(b) Let À be an eigenvalue of A. Check with Matlab that there exists (at 
least) one index 7 such that A € Dj. 
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(c) Rigorously prove this result. 
3. A matrix A is said to be strictly diagonally dominant if 


lail > So lajl Us<i<n). 
j+i 
(a) Write a function DiagDomMat (n) returning an n x n diagonally dom- 
inant matrix. 
(b) For various values of n, compute the determinant of A=DiagDomMat (n). 
What do you notice? 


(c) Justify your answer. 
4. Write a program PlotGersh that plots the Gershgorin disks for a given 


matrix. 
Application: 
1 0 1 
A=|-2 6 1 
1 -1 -3 
2.24. For each of the matrices 
1 2 8 75 0. .25 
A,={13 2 1],Ao=] 0. 1. 0. f, 
4 2 1 25 0. .75 
375 0 —.125 —.25 0. —.75 
A3 = 0 5 0 , A, = 0. 1L 0. ; 
—.125 0 .375 —.75 0. —.25 
compute A? for n = 1,2,3,... In your opinion, what is the limit of A? as n 


goes to infinity? Justify the observed results. 


2.25. Plot the image of the unit circle of R? by the matrix 


-1.25 0.75 
aa ( 0.75 He) om 


to reproduce Figure 2.2. Use the Matlab function svd. 


2.26. For different choices of m and n, compare the singular values of a matrix 
A=rand(m,n) and the eigenvalues of the block matrix B = (i ‘ar Justify. 


2.27. Compute the pseudoinverse AŤ (function pinv) of the matrix 


1 -l1 4 
2 -2 0 
aS 3 —3 5 
=] -1 0 


Compute AtA, AAt, AATA, and AtAAt. What do you observe? Justify. 


44 2 Definition and Properties of Matrices 


2.28. Fix n = 100. For different values of r < n, compare the rank of 
A=MatRank(n,n,r) and the trace of AA'. Justify. 


2.29 (x). The goal of this exercise is to investigate another definition of 
the pseudoinverse matrix. Fix m = 10,n = 7. Let A be a matrix de- 
fined by A=MatRank(m,n,5). We denote by P the orthogonal projection onto 
(Ker A)+, and by Q the orthogonal projection onto Im A. 


1. Compute a basis of (Ker A)+, then the matrix P. 

2. Compute a basis of Im A, then the matrix Q. 

3. Compare on the one hand A‘ A with P, and on the other hand, AA‘ with 
Q. What do you notice? Justify your answer. 

4. Let y € C” and define xı = Px, where x € C” is such that Ar = Qy. 
Prove (without using Matlab) that there exists a unique such 2). Consider 
the linear map y : C™ — C” by (y) = xı. Show (without using Matlab) 
that the matrix corresponding to this map (in the canonical basis) is AT. 
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Matrix Norms, Sequences, and Series 


3.1 Matrix Norms and Subordinate Norms 


We recall the definition of a norm on the vector space K” (with K = R or C.) 


Definition 3.1.1. We call a mapping denoted by || - ||, from K" into R? sat- 
isfying the following properties a norm on KK” 


1.Yx e RK”, |lal| =O xc =0; 
2. Yx EK", VAEK, [àz] = |All]; 
3. Yx eK", Vy E R”, |la+yll < liell + Ilyll - 


If K” is endowed with a scalar (or Hermitian) product (-,-), then the mapping 
a — (x,x)'/? defines a norm on K”. The converse is not true in general: we 
cannot deduce from any norm a scalar product. The most common norms on 
K” are (a; denotes the coordinates of a vector x in the canonical basis of K”): 


e the Euclidean norm, ||z|/2 = (z |xi|?) 1/2. 


e the norm, ||ællp = (90%, |ai|?)'/” for p > 1; 

e the £%-norm, ||æ||oo = maxi<i<n |v] - 

Figure 3.1 shows the unit circles of R? defined by each of these norms l1, 1?, 
and /°°. We recall the following important result. 


Theorem 3.1.1. If E is a vector space of finite dimension, then all norms 
are equivalent on E. That is, for all pairs of norms ||- ||, || - ||’ there exist two 
constants c and C such that O < c < C, and for all x € E, we have 


lla] < æl < Cla]. 
Let us recall the equivalence constants between some vector norms lp (here 


x € K” and p > 1 is an integer): 


<Ilrll. < ni/P 
Iel < fp < nV lao, m 


lzel < æli < Vnllalle. 
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er 


=f 


Fig. 3.1. Unit circles of R? for the norms ¢', &, and 4°. 


In this section, we confine ourselves to the case of square matrices. Since the 

space Mn(K) of square matrices of size n with entries in K is a vector space 
. . 2 . 

on K, isomorphic to K” , we can define the following norms: 


1/2 
e Frobenius (or Schur or Euclidean) norm, || A||; = (ia Ss) lai) 


1/q 
e 11 norm, || Al] = (Eh 1 dja jst) for q > 1; 


e [% 


There are particular norms on M that satisfy an additional inequality 
on the product of two matrices. 


Definition 3.1.2. A norm ||- || defined on Mn(K) is a matrix norm if for all 
matrices A, B € M, (K), 


|ABl| < |All BI. 


Example 3.1.1. The Frobenius norm is a matrix norm. Indeed, for any matrices 
A and B the Cauchy—Schwarz inequality yields 


Daa (Erue) (Sirt), 


i=1 g=1 |k=1 =r j=l 


which is the desired inequality. 


Example 3.1.2. Not all norms defined on M,,(KK) are matrix norms. For in- 
stance, the norm defined by 
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Al = max lai; 3.2 
All =, max Jai (3.2) 
is not a matrix norm when n > 1, because for the matrix A whose entries are 


all equal to 1, we have 
1 = ||All? < |A] =n. 


We shall consider norms on Mn(K) that are yet more particular and that 
are said to be subordinate to a vector norm on K”. 


Definition 3.1.3. Let ||- || be a vector norm on K”. It induces a matrix norm 
defined by 


(a=: cup. A 


(3.3) 
xEK" «40 ilæl| 


which is said to be subordinate to this vector norm. 


The reader can easily check that indeed, (3.3) is a matrix norm on M, (K). 
For convenience, the vector and matrix norms are denoted in the same way. 


Remark 3.1.1. Let us recall briefly the difference between a maximum and a 
supremum. The supremum, sup;ez Xi, of a family (x;);e, of real numbers is 
the smallest constant C that bounds from above all the xi: 


sup“; =minC with C = {C € R such that z; < C, Vi € I}. 
icI 


(This smallest constant exists, possibly equal to +00, since C is a closed set 
of R as the intersection of the closed sets {C € R such that C > x;}.) If there 
exists Zip such that sup;¢; Xi = Tip, then the supremum is said to be attained 
and, by convention, we denote it by max;e; xi. Of course, this last notation, 
which specifies to the reader that the supremum is attained, coincides with 
the usual maximal value for a finite family (x;)je7. 


Proposition 3.1.1. Let ||- || be a subordinate matrix norm on Mn(K). 


1. For all matrices A, the norm ||A|| is also defined by 


|All= sap ||Az|]= sup _ ||Az|]. 
2K", |[a||=1 2K", ||2||<1 


2. There exists xa E€ K", x24 #0, such that 


_ lizal 
[eal 


lAl 


and sup can be replaced by max in the definitions of || A||. 
3. The identity matrix satisfies 


[al = 1. (3.4) 
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4. A subordinate norm is indeed a matrix norm: for all matrices A and B, 
we have 


|ABl| < |All BIL 


Proof. The first point is obvious. The second point is proved by observ- 
ing that the function x — ||Az|| is continuous on the bounded, closed, and 
therefore compact set {x € K”, ||a|| = 1}. Thus it attains its maximum. The 
third point is obvious, whereas the fourth is a consequence of the inequality 
|| ABal] < |All Bal). 


Remark 3.1.2. There are matrix norms that are not subordinate to any vector 
norm. A well-known example is the Frobenius norm, for which the norm of 
the identity matrix is ||ZIn||r = yn. This is not possible for a subordinate 
norm according to (3.4). 


Note that the equivalences (3.1) between vector norms on K” imply the 
corresponding equivalences between the subordinate matrix norms. Namely, 
for any matrix A € Mn(K), 


nP] Allo < ||Allp < n/? ||Alloo; 


3.5 
n-¥/2IAlly < [Ali < 22/2 I)Alla. oe) 


The computation of matrix norms by Definition 3.1.3 may be quite difficult. 
However, the usual norms |]. ||, and ||. ||. can be computed explicitly. 
Proposition 3.1.2. We consider matrices in M,(K). 


1. The matrix norm ||A||1, subordinate to the l'-norm on K”, satisfies 


|All, = max, $3 Jazsl): 


2. The matrix norm ||Al|oo, subordinate to the I°-norm on K”, satisfies 


m 


|Allac = max (X la:sl). 


j=1 


Proof. We write 


[Aah = $ |$ ayz: < > jl Dolo < llela (max, 3 lass); 


i=l 9=1 


from which we deduce the inequality 


n 
lAl < max, as (3.6) 
i= 
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Let jo be the index satisfying 


max, 3 lai j| = 5 lai,jol: 


i=1 
Let x° be defined by x9 = 0 if j # jo, and x$, = 1. We have 

lz? =1 and ||Az°lj = max s D lasl 
which implies that inequality (3.6) is actually an equality. Next, we write 

n n 
| Azlloo = Be eum < lle (ex, lasl); 
from which we infer the inequality 
n 
|A < ex Dash (3.7) 
Let io be the index satisfying 
n n 
mes lai jl = 2 [điojl: 


Let x° be defined by x} = 0 if aij = 0, and x$ Tioj if tioj # 0. If A Æ 0, 


= Tail 
then x° Æ 0, and therefore ||x°||,, = 1 (if A = 0, then there is nothing to 
prove). Furthermore, 


n n n 
pas E 0529] = D7 laiogl = max D las, 
j=l j=l => j=l 


which proves that inequality (3.6) is actually an equality. 


We now proceed to the matrix norm subordinate to the Euclidean norm: 


Proposition 3.1.3. Let ||All2 be the matrix norm subordinate to the Euclid- 
ean norm on K”. We have 


|| All2 = || A*||2 = largest singular value of A. 


Proof. First of all, we have 


|| Aa||5 (A* Az, £) 
7 = sup =a 
xEK” «40 zll xEK" 740 (x, x) 


[|All = 
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By Lemma 2.7.1, A*A is self-adjoint and positive. Hence it is diagonalizable 
and its eigenvalues (A;(A*A))1<i<n are nonnegative real numbers. In the or- 
thonormal basis of its eigenvectors, we easily check that 
A*A 
sup 27) = max \;(A*A). 
xEK?” 240 (z, x) 1<i<n 

Since the singular values of A are the positive square roots of the eigenvalues 
of A* A, we infer the desired result. Moreover, the Cauchy—Schwarz inequality 
yields 


(A*Az, x) — ||A*Aallollalle — |A*Allollella 
(xs) ~~ (ma) (a, a) 


We deduce that ||Al|2 < ||A*||2. Applying this inequality to A*, we obtain the 
desired inequality, which is to say || Al|z = || A*||2- 


< |4 ll2llAll2. 


Remark 3.1.8. A real matrix may be seen either as a matrix of Mn (R) or as 
a matrix of M,(C), since R C C. If ||- ||c is a vector norm in C”, we can 
define its restriction || -||z to R”, which is also a vector norm in R”. For a real 
matrix A € Ma (R), we can thus define two subordinate matrix norms ||A||c 
and ||Allr by 


A 
and ||All;p = sup Are 


Azl|lc 
es. sap | 
ceR",240 ||2\lR 


zecr,240 ||2\|c 


At first glance, these two definitions seem to be distinct. Thanks to the explicit 
formulas of Proposition 3.1.2, we know that they coincide for the norms ||2'||1, 
\|z||2, or ||2||.o. However, for other vector norms we may have ||A||c > || All. 
Since some fundamental results, like the Schur factorization theorem (Theo- 
rem 2.4.1), hold only for complex matrices, we shall assume henceforth that 
all subordinate matrix norms are valued in C”, even for real matrices (this is 
essential, in particular, for Proposition 3.1.4). 


Remark 3.1.4. The spectral radius o(A) is not a norm on M,,(C). Indeed, we 
may have o(A) = 0 with A ¥ 0, for instance, 


01 
A= ( : i l 
Nonetheless, the lemma below shows that oọ(A) is a norm on the set of normal 
matrices. 
Lemma 3.1.1. Let U be a unitary matrix (U* = U~'). We have 


||UAll2 = | AU|]2 = ||Alļ2- 


Consequently, if A is a normal matriz, then ||Al|z = o(A). 
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Proof. Since U*U = I, we have 


|WAIR= sup WA _ (U*U Ax, Az) 


ae = = |All. 
vecr,240 llel zecr,z40 (2,2) 
Moreover, the change of variable y = Ux satisfies ||x||2 = ||y||2, hence 


| AU «||3 | Ay ll I| Ayll3 
|AU||3= sup ——Z7= su MEE <= z = |All. 
cecr,240 llel yec» ,yzo |UTtyll  yecr uzo lull 


If A is normal, it is diagonalizable in an orthonormal basis of eigenvectors 
A = U diag (à1,.. . , àÀn)U*, and we have ||Alļļ2 = || diag (A;)|/2 = o(A). 


Lemma 3.1.1 shows that o(A) and ||Al|2 are equal for normal matrices. 
Actually, any matrix norm ||A|| is larger than ọ(A), which is, in turn, always 
close to some subordinate matrix norm, as shown by the next proposition. 


Proposition 3.1.4. Let ||- || be a matrix norm defined on Mn (C). It satisfies 


e(A) < |All. 
Conversely, for any matrix A and for any real number £ > 0, there exists a 
subordinate norm ||- || (which depends on A and £) such that 
|All < o(4) +e. 


Proof. Let A € C be an eigenvalue of A such that o(A) = |A], anda # 0 a 
corresponding eigenvector. If the norm ||.|| is subordinate to a vector norm, 
we write ||Az|| = o(A)ljæ|| = ||Aa|| < | Al] ||z|] and therefore (A) < ||Al|. If 
||.|| is some matrix norm (not necessarily subordinate), we denote by y € C” a 
nonzero vector, so the matrix xy* is nonzero, and we have Axy* = Axy*. Then, 
taking the norm of this last equality yields |A| ||xy*|| = ||Axy*|| < ||A|| ||ay*||, 
which implies @(A) < || Al]. 

To prove the second inequality, we use the Schur factorization theorem 
(Theorem 2.4.1) which states that for any A, there exists a unitary matrix U 
such that T = U~' AU is triangular: 


tha ti2 Egi tin 
pal” i 
0... 0 tnn 


and the diagonal entries t;; are the eigenvalues of A. For any ô > 0 we 
introduce the diagonal matrix Ds = diag (1, 6,67,...,5"~+), and we define a 
matrix Ts = (UDs)~!A(UD5) = Dy'T D; that satisfies 
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ti ôt12 HS Olti n 
Ts = 
: aS tnin 
O ... 0 tnn 


Given € > 0, we can choose ô sufficiently small that the off-diagonal entries of 
Ts are also very small. Namely, they satisfy for all 1 < i < n— 1, 


n 
5 A |ti l < Es 
j=i+1 


Since the t;; are the eigenvalues of T3, which is similar to A, we infer that 
\|Ts\loo < o(A) +e. Then, the mapping B > ||B|| = ||(UDs)~'B(UDs)|loo is 
a subordinate norm (that depends on A and €) satisfying 


|All < o(4) +e, 


thereby yelding the result. 


Remark 3.1.5. The second part of Proposition 3.1.4 may be false if the norm 
is not a matrix norm. For instance, the norm defined by (3.2) is not a matrix 
norm, and we have the counterexample 


o(A)=2 > |All =1 for A= E F 


In Exercise 3.8, we dwell on the link between the matrix norm and its spectral 
radius by showing that 
, p F 
o(A) = lim (4*1). 
Remark 3.1.6. Propositions 3.1.2 and 3.1.4 provide an immediate upper bound 
for the spectral radius of a matrix: 


n n 


ERS o. a 
o(A) & min( max $- Jai; ) max DU lai.) $ 


i=1 j=1 


3.2 Subordinate Norms for Rectangular Matrices 


Similar norms can be defined on the space Mm,n(K) of rectangular (or non- 
square) matrices of size m x n with entries in K. For instance, 
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e the Frobenius (or Schur, or Euclidean) norm 


1/2 
m n 
Alle = |X $ lal? i 
i=1 j=l 
m 5 1/q 
e the /4-norm, || A|| = paa we Jail") for q > 1; 
e the /°-norm, || A|| = maxı<i<m,1<j<n |ai,j| - 


We may, of course, define a subordinate matrix norm in Mm,n(K) by 


reEK” «£0 lz Iln 
where ||- ||» (respectively ||- ||m) is a vector norm on K” (respectively K”). 


We conclude this section by defining the best approximation of a given (not 
necessarily square) matrix by matrices of fixed rank. Recall that this property 
is important for the example of image compression described in Section 1.5. 


Proposition 3.2.1. Let A= VXU* be the SVD factorization of some matrix 
AE Mm,n(C) having r nonzero singular values arranged in decreasing order. 
For1 < k <r, the matrix Ak = De Liviu is the best approximation of A 
by matrices of rank k, in the following sense: for all matrices X € Mm,n(C) 
of rank k, we have 

|A- Ax|l2 < ||A — XIl2. (3.8) 


Moreover, the error made in substituting A with Ap is |A — Ax|lo = Uk+1- 


Proof. According to (2.10), we have 


i 
A-Ag= X. mva} = [ve]. vr] diag (Hetis +--> Hr) 
i=k+1 


* 


Up 


Denoting by D € Mm,n(R) the matrix diag (0,...,0, Hk+1;---, Hr,0,.--,0), 
we have A — Ay = VDU*, and since the Euclidean norm is invariant under 
unitary transformation, we have ||A— Ax,||2 = ||D||2 = 441. Let us now prove 
the approximation property. For all x € C”, we have 


|| Azl2 = |V £U*z|ļ2 = ||LU* alo. (3.9) 


Let E be the subspace of C”, of dimension k + 1, generated by the vectors 
u1, ..., Uk+1. If x E€ E, we have x = po ziu; and 
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k+1 k+1 k+1 

U*« = U* Sas = 5 x;U“ u;i = 5 Tiĉi, 

i=1 i=1 i=1 

where e; is the ith vector of the canonical basis of C”. Thus we have 
SU*r = (M1£1,.---, Hk+1£k+1;0,.-.,0). 
So by (3.9) and the decreasing order of the singular values pi, 
|Azll2 > ve+illzlla, Vee E. (3.10) 


If the matrix X € Mm,n(C) is of rank k < r, its kernel is of dimension 
n—k>r-—k > 1, and for all x € Ker (X), we have 


| Az|l2 = ||(A- X)zll2 < |A- X]l2 IIell2. 
Assume that X contradicts (3.8): 
lA- Xl2 < ||A — Aglle- 
Hence for all x € Ker (X), 
|| Azll2 < ||A — Aglle llæll2 = Hesallelle, 


and therefore if x € EN Ker (X) with x 4 0, we end up with a contradiction 
o (3.10). Indeed, the two spaces have a nonempty intersection since dim E+ 
dim Ker (X) > n, so that (3.8) is finally satisfied. 


3.3 Matrix Sequences and Series 


In the sequel, we consider only square matrices. 


Definition 3.3.1. A sequence of matrices (A;)i>1 converges to a limit A if 
for a matrix norm ||- ||, we have 


Jim |A; — A|| = 0, 


and we write A = liMmi— +o Aj. 


The definition of convergence does not depend on the chosen norm, since 
Mn(C) is a vector space of finite dimension. Therefore Theorem 3.1.1, which 
asserts that all norms are equivalent, is applicable, and thus if a sequence 
converges for one norm, it converges for all norms. 


Remark 3.3.1. Let us recall that M,C), having finite dimension, is a com- 
plete space, that is, every Cauchy sequence of elements of M,,(C) is a con- 
vergent sequence in M,,(C): 

lim lim |4; — 4;|| = 0 = 3A € Mn (C) such that jim, ||A; — Al] =0. 


i—-+00 j> 
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A matrix series is a sequence (S;);>0 defined by partial sums of another se- 
quence of matrices (A;)i>0: Si = D Aj. A series is said to be convergent if 
the sequence of partial sums is convergent. Among all series, we shall be more 
particularly concerned with matrix power series defined by (a;A’);>0, where 
each a; is a scalar in C and A’ is the ith power of the matrix A. A necessary 
but not sufficient condition for the convergence of a power series is that the 
sequence a;A’ converge to 0. The following result provides a necessary and 
sufficient condition for the sequence of iterated powers of a matrix to converge 
to 0. 


Lemma 3.3.1. Let A be a matrix in M,(C). The following four conditions 
are equivalent: 


2. lim;_.400 Atx = 0 for all vectors x € C”; 
3. o(A) <1; 
4. there exists at least one subordinate matrix norm such that || A|| < 1. 


Proof. Let us first show that (1) > (2). The inequality 
|A’ell < IA Ilall 


implies limj.4. Ax = 0. Next, (2) > (3); otherwise, there would exist A 
and x Æ 0 satisfying Ax = Ax and |A| = o(A), which would entail that the 
sequence At = Atx cannot converge to 0. Since (3) = (4) is an immediate 
consequence of Proposition 3.1.4, it remains only to show that (4) = (1). To 
this end, we consider the subordinate matrix norm such that ||A|| < 1, and 
accordingly, 

||A’ll < Alf — 0 when i = +00, 


which proves that A’ tends to 0. 


We now study some properties of matrix power series. 


Theorem 3.3.1. Consider a power series on C of positive radius of conver- 


gence R: 
+oo 
> ajz’ 
i=0 


For any matriz A € Mn(C) such that o(A) < R, the series (a;A’)i>o is 
convergent, i.e., =, a;A' is well defined in M,,(C). 


< +00, Vz € C such that |z| < R. 


Proof. Since e(A) < R, thanks to Lemma 3.3.1, there exists a subordinate 
matrix norm for which we also have || A|| < R. We check the Cauchy criterion 
for the sequence of partial sums: 


i i 


X aA < XO faxillAll*. (3.11) 


k=j+1 k=j+1 
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Now, a power series on C is absolutely convergent inside its disk of conver- 
gence. Hence ||A|| < R implies that the right-hand term in (3.11) tends to 0 
as j and 7 tend to +oo. The convergence of the series is therefore established. 


The previous notion of matrix power series generalizes, of course, the def- 
inition of matrix polynomial and may be extended to analytic functions. 


Definition 3.3.2. Let f(z) : Ct C be an analytic function defined in the 
disk of radius R > 0 written as a power series: 


+00 
flia = Naiz’ Yz € C such that |z| < R. 
i=0 
By a slight abuse of notation, for any A E€ M,(C) with o(A) < R, we define 
the matriz f(A) by 


+00 
f(A) =o aA. 
1=0 


Let us give some useful examples of matrix functions. The exponential function 
is analytic in C. Accordingly, for all matrices A (without restriction on their 
spectral radii), we can define its exponential by the formula 


Similarly, the function 1/(1 — z) is analytic in the unit disk and thus equal to 
ri zi for z € C such that |z| < 1. We therefore deduce an expression for 


i= 


(I — A)-}. 


Proposition 3.3.1. Let A be a matrix with spectral radius o(A) < 1. The 
matrix (I — A) is nonsingular and its inverse is given by 


+00 
(I-A) =Ņ_ A. 
i=0 


Proof. We already know that the series (A‘);>0 is convergent. We compute 


Pp P 


-AX Ab =o AU- A) = 1 - AH, 


i=0 i=0 


Since A?+! converges to 0 as p tends to infinity, we deduce that the sum of 
the series (A’);>0 is equal to (J — A)~1. 
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Remark 3.8.2. Proposition 3.3.1 shows in particular that the set of nonsingular 
matrices is an open set in M,,(C). Indeed, consider a subordinate norm ||- ||. 
Given a nonsingular matrix M, any matrix N such that ||M—N|| < | Mtt 
is nonsingular as the product of two nonsingular matrices: 


N=M(I-M-(M—N)), 


since ||M~'(M — N)|| < 1. Hence for any nonsingular matrix M, there exists 
a neighborhood of M that also consists of nonsingular matrices. This proves 
that the set of nonsingular matrices is open. 


3.4 Exercises 


3.1. Let A be the matrix defined by A=rand(n,n), where n is a fixed integer. 
Define M = max; ; | A;,;| (M may be computed by the Matlab function max). 
Compare M with the norms ||Al|1, ||All2, || Allo, and ||A|| (use the function 
norm). Justify. 


3.2. Let T be a nonsingular triangular matrix defined by one of the in- 
structions T=LNonsingularMat(n) and T=UNonsingularMat(n) (these func- 
tions have been defined in Exercise 2.2), where n is a fixed integer. Define 
m = (min; |T;,;|)' (m may be computed by the Matlab function min). Com- 
pare m with the norms ||T~1]|1, ||T~"l2, ITH]; and |T} F. 


3.3 (x). Define a diagonal matrix A by 
u=rand(n,1); A=diag(u);. 
Compute the norm ||A||,, for p = 1,2,00. Comment on the observed results. 


3.4. For various values of n, define two vectors u=rand(n,1) and v=rand(n,1). 
Compare the matrix norm ||uv*||2 with the vector norms ||u||2 and ||v||2. Jus- 
tify your observation. Same questions for the Frobenius norm as well as the 
norms |].||; and |].||o0- 


3.5. For different values of n, define a nonsingular square matrix A by 
A=NonsingularMat(n); (see Exercise 2.2). Let P and v be the matrix and 
the vector defined by 

[P,D]=eig(A*A’); [d k]=min(abs(diag(D))); v=P(:,k); 
Compare ||A7!v||z and ||A7~"||2. Justify. 


3.6. Let A be a real matrix of size m x n. 


1. What condition should A meet for the function x + ||Az||, where ||.|| 
denotes some norm on R”, to be a norm on R™? Write a function NormA 
that computes ||Az||z for x € R”. 

2. Assume A is a square matrix. Write a function NormAs that computes 
/ (Ax, x) for x € R”. Does this function define a norm on R”? 
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3.7 (*). Define a matrix A by A=PdSMat (n) (see Exercise 2.20) and denote by 
||- a the norm defined on R” by ||x||4 = ,/(Az, x) (see the previous exercise). 
Let S4 be the unit sphere of R” for this norm: 


Sa ={x ER", |lzl|a = 1}. 


— 


. Prove (do not use Matlab) that S4 lies between two spheres (for the 


Euclidean norm) centered at the origin, and of respective radii x and 
1 . 
that is, 


VXmax’ 


1 1 
V Amax “/ Amin ; 


where Amin (respectively Amax) denotes the smallest (respectively largest) 
eigenvalue of A. 
. Plotting S4 for n = 2. 

(a) Let I, be the line x2 = pri, p E€ R. Compute (Az, x) for x € I, (do 
not use Matlab). Compute the intersection of Ij, and S4. 

(b) Write a function function [x,y]=UnitCircle(A,n) whose input ar- 
guments are a 2 x 2 matrix A, and an integer n and that returns two 
vectors x and y, of size n, containing respectively the abscissas and 
the ordinates of n points of the curve S4. 

(c) Plot on the same graph the unit circle for the Euclidean norm and 
Sa for A= [7 5; 5 7]. Prove rigorously that the curve obtained is 
an ellipse. 

(d) Plot on the same graph the unit circles for the norms defined by the 
matrices 


7 5 6.5 5.5 (21 
a=( a) B= (53 l c=. a one 


Comment on the results. 


gt ESA => 


< llæll2 < 


N 


3.8 (x). Define a matrix A by A=rand(n,n). Compare the spectral radius 


of A and AR 3" for k = 10,20,30,...,100. What do you notice? Does the 
result depend on the the chosen matrix norm? (Try the norms ||. la; || - |loo; 
and ||. ||.) 

Explanation. 


1. Prove that (A) < ||A*||1/", Vk € N*. 
2. For £ > 0, define A; = aay Prove that 0(A-) < 1. Deduce that there 


exists ko € N such that k > ko implies o(A) < ||A*||!/* < o(A) +e. 
3. Conclude. 


3.9. Define a matrix A=MatRank(m,n,r) and let [V S U] = svd(A) be the 
SVD factorization of A (for example, take m = 10, n = 7, r = 5). For 
k=1,...,r—1, we compute the approximated SVD factorization of A by the 
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instruction [v,s,u] = svds(A,k). We have seen in Proposition 3.2.1 that 
the best approximation (in ||.||2-norm) of matrix A by m x n matrices of rank 
k is the matrix Ay; = usv’, and that the approximation error is ||A — Ag||2 = 
Sk+1,k+1- We set out to compute the same error, but in the Frobenius norm. 


1. For k = r— 1 N 1, display ||A — A,||% and the square of the singular 
values of A. What relation do you observe between these two quantities? 
2. Justify this relation rigorously. 


3.10. We revisit the spring system in Section 1.3, assuming in addition that 
each mass is subjected to a damping, proportional to the velocity, with a given 
coefficient c; > 0; the zero right-hand sides of (1.8) have to be replaced by 
—c;y; (damping or breaking term). 


1. Show that the vector y = (y1, Y2, y3)* is a solution of 
My+Cy+ Ky=0, (3.13) 


where C is a matrix to be specified. 
2. Define a vector z(t) = (yt, ġt)Ý € RS. Prove that z is a solution of 


z(t) = Az(t), (3.14) 


for some matrix A to be specified. For a given initial datum z(0), i.e., for 
prescribed initial positions and speeds of the three masses, (3.14) admits 
a unique solution, which is precisely z(t) = e4*z(0). (For the definition of 
the matrix exponential, see Section 3.3.) 

3. Assume that the stiffness constants are equal (kı = k2 = k3 = 1), as well 
as the damping coefficients (c1 = c2 = c3 = 1/2), and that the masses 
are Mı = m3 = 1 and mg = 2. The equilibrium positions of the masses 
are supposed to be zı = —1, z2 = 0, and x3 = 1. At the initial time, 
the masses mı and m3 are moved away from their equilibrium position 
by y1(0) = —0.1, y3(0) = 0.1, with initial speeds ġı(0) = —1, 3(0) = 1, 
while the other mass mz is at rest, yo(0) = 0, yo(0) = 0. Plot on the same 
graph the time evolutions of the positions 2; (t) = x; + y,(t) of the three 
masses. Vary the time t from 0 to 30, by a step of 1/10. Plot the speeds 
on another graph. Comment. 

Hint: use the Matlab function expm to compute e^. 


A 


Introduction to Algorithmics 


This chapter is somewhat unusual in comparison to the other chapters of this 
course. Indeed, it contains no theorems, but rather notions that are at the 
crossroads of mathematics and computer science. However, the reader should 
note that this chapter is essential from the standpoint of applications, and for 
the understanding of the methods introduced in this course. For more details 
on the fundamental notions of algorithmics, the reader can consult [1]. 


4.1 Algorithms and pseudolanguage 


In order to fully grasp the notion of a mathematical algorithm, we shall illus- 
trate our purpose by the very simple, yet instructive, example of the multipli- 
cation of two matrices. We recall that the operation of matrix multiplication 
is defined by 

Mn,p(K) x Mp.q(K) — Mn(K) 


(A, B) — C= AB, 


where the matrix C is defined by its entries, which are given by the simple 
formula 


p 

Ci j = So andes, 1 < a < n, il <3 < q. (4.1) 

k=1 
Formula (4.1) can be interpreted in various ways as vector operations. The 
most “natural” way is to see (4.1) as the scalar product of the ith row of A 
with the jth column of B. Introducing (a;)1<i<n, the rows of A (with a; € R?), 
and (bj)1<j<q, the columns of B (with b; € RP), we successively compute the 
entries of C as 
Ci j = ai: bj . 

However, there is a “dual” way of computing the product C in which the 
prominent role of the rows of A and columns of B is inverted by focusing 
rather on the columns of A and the rows of B. 
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Let (a¥)ı<k<p be the columns of A (with aë € R”), and let (b*):<p<p be 
the rows of B (with b! € R1). We note that C is also defined by the formula 


C= 5 ak (oF) ', (4.2) 


where we recall that the tensor product of a column vector by a row vector is 
defined by 


zy = (Lis a <i<n, 1<j<q £ Mn,q(R), 


where the (2;)1<i<n are the entries of x € R”, and (y;)1<;<q the entries of 
y € R4. Formula (4.2) is no longer based the scalar product, but rather on 
the tensor product of vectors, which numerically amounts to multiplying each 
column of A by scalars that are the entries of each row of B. 

Of course, in both computations of the product matrix C, the same mul- 
tiplications of scalars are performed; what differentiates them is the order of 
the operations. In theory, this is irrelevant, however in computing, these two 
procedures are quite different! Depending on the way the matrices A, B,C 
are stored in the computer memory, access to their rows and columns can be 
more or less fast (this depends on a certain number of factors such as the 
memory, the cache, and the processor, none of which we shall consider here). 
In full generality, there are several ways of performing the same mathematical 
operation. 


Definition 4.1.1. We call the precise ordered sequence of elementary opera- 
tions for carrying out a given mathematical operation an algorithm. 


This definition calls immediately for a number of comments. 


y One has to distinguish between the mathematical operation, which is the 
goal or the task to be done, and the algorithm, which is the means to that 
end. In particular, there can be several different algorithms that perform 
the same operation. 

v Two algorithms may carry out the same elementary operations and differ 
only in the order of the sequence (that is the case of the two algorithms 
above for matrix multiplication). All the same, two algorithms may also 
differ by the very nature of their elementary operations, while producing 
the same result (see below the Strassen algorithm). 

v The notion of elementary operation is necessarily fuzzy. We can agree 
(as here and in the whole course) that it is an algebraic operation on 
a scalar. But after all, a computer knowns only binary numbers, and a 
product of real numbers or the extraction of a square root already requires 
algorithms! Nevertheless, we shall never go this far into details. By the 
same token, if a product of block matrices has to be computed, we can 
consider the multiplication of blocks as an elementary operation and not 
the scalar multiplication. 
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The script of an algorithm is an essential step, not only for writing a com- 
puter program out of a mathematical method, but also for the assessment 
of its performance and its efficiency, that is, for counting the number of el- 
ementary operations that are necessary to its realization (see Section 4.2). 
Of course, as soon as a rigorous measure of the efficiency of an algorithm is 
available, a key issue is to find the best possible algorithm for a given mathe- 
matical operation. This is a difficult problem, which we shall barely illustrate 
by the case of the Strassen algorithm for matrix multiplication (see Section 
4.3). 

Although the above definition stipulates that an algorithm is characterized 
by an “ordered” sequence of elementary operations, we have been up to now 
relatively vague in the description of the two algorithms proposed for matrix 
multiplication. More precisely, we need a notion of language, not so much for 
writing programs, but for arranging operations. We call it pseudolanguage. 
It allows one to accurately write the algorithm without going through purely 
computational details such as the syntax rules (which vary with languages), 
the declaration of arrays, and the passing of arguments. It is easy to transcribe 
(except for these details) into a computing language dedicated to numerical 
calculations (for instance, Fortran, Pascal, C, C++). 

The reader will soon find out that the pseudolanguage is a description 
tool for algorithms that for the sake of convenience obeys no rigorous syntax. 
Even so, let us insist on the fact that although this pseudolanguage is not 
accurately defined, it complies with some basic rules: 


1. The symbol = is no longer the mathematical symbol of equality but the 
computer science symbol of allocation. When we write a = b we allocate 
the value of b to the variable a by deleting the previous value of a. When 
we write a = a+ b we add to the value of a that of b, but by no means 
shall we infer that b is zero. 

2. The elementary operations are performed on scalars. When vectors or ma- 
trices are handled, loops of elementary operations on their entries should 
be written. Note in passing that Matlab is able to perform elementary 
operations directly on matrices (this is actually much better, in terms of 
computing time, than writing loops on the matrix entries). 

3. At the beginning of the algorithm, the data (entries) and the results (out- 
puts) should be specified. In the course of an algorithm, intermediate 
computational variables may be used. 

4. Redundant or useless operations should be avoided for the algorithm to 
be executed in a minimum number of operations. 


As an example, we consider the two matrix multiplication algorithms that we 
have just presented. 

Let us remark that these two algorithms differ only in the order of their 
loops (we have intentionally kept the same names for the indices in both 
algorithms). If this makes no difference from a mathematical viewpoint, it is 
not quite the same from the computer science viewpoint: the entries of A and 
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Data: A and B. Output: C = AB. 


Fori=1/n 
For j=1 7q 
Cry = 0} 
For k=1//p 
Oii = Cig + Ai rnBrj 
End k 
End j 
End en i 


Algorithm 4.1: Product of two matrices: “scalar product” algorithm. 


Data: A and B. Output: C = AB. 


Fori=1/n 
For j=1 7q 
Ciz =0 
End j 
End i 
Fork=1 p 
Fori=1/n 
For j=1 7q 
Ci j = Cig + Aik Bry 
End j 
End i 
End k 


Algorithm 4.2: Product of two matrices: “tensor product” algorithm. 


B are accessed either along their rows or columns, which is not executed with 
the same speed depending on the way they are stored in the memory of the 
computer. In both algorithms, the order of the loops in 7 and j can be changed. 
Actually, we obtain as many algorithms as there are possible arrangements of 
the three loops in i, j, k (check it as an exercise). 

Obviously, the matrix product is a too simple operation to convince the 
reader of the usefulness of writing an algorithm in pseudolanguage. Never- 
theless, one should be reassured: more difficult operations will soon arrive! 
Let us emphasize again the essential contributions of the script in pseudolan- 
guage: on the one hand, it provides a good understanding of the sequencing 
of the algorithm, and on the other hand, it enables one to accurately count 
the number of operations necessary to its execution. 


4.2 Operation Count and Complexity 


The performance or the cost of an algorithm is mainly appraised by the num- 
ber of operations that are required to execute it. This cost also depends on 
other factors such as the necessary number of registers of memory and the 
number of memory accesses to look for new data, but we neglect them in 
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the sequel. An algorithm will be the more efficient, the fewer operations it 
requiers. 


Definition 4.2.1. We call the number of multiplications and divisions re- 
quired to execute an algorithm its complexity. 


We neglect all other operations such as additions (much quicker than mul- 
tiplications on a computer) or square roots (much scarcer than multiplications 
in general), which makes simpler the counting of operations. If the algorithm 
is carried out on a problem of size n (for instance, the order of the matrix or 
the number of entries of a vector), we denote by Nop(n) its complexity, or its 
number of operations. The exact computation of Nop(n) is often complex or 
delicate (because of boundary effects in the writing of loops). We thus con- 
tent ourselves in finding an equivalent of Nop(n) when the dimension n of the 
problem is very large (we talk then about asymptotic complexity). In other 
words, we only look for the first term of the Taylor expansion of Nop(n) as n 
tends to infinity. 

Thanks to the transcription of the algorithm into pseudolanguage, it is easy 
to count its operations. At this stage we understand why a pseudolanguage 
script should avoid redundant or useless computations; otherwise, we may 
obtain a bad operation count that overestimates the actual number of opera- 
tions. In both examples above (matrix product algorithms), the determination 
of Nop(n) is easy: for each i, j,k, we execute a multiplication. Consequently, 
the number of operations is npq. If all matrices are square, n = p = q, we get 


the classical result Nop(n) ~ n°. 


4.3 The Strassen Algorithm 


It was believed for a long time that the multiplication of matrices of order 
n could not be carried out in fewer than n? operations. So the discovery of 
a faster algorithm by Strassen in 1969 came as a surprise. Strassen devised 
a very clever algorithm for matrix multiplication that requires many fewer 
operations, on the order of 


Nop(n) = O(n!82") with log, 7 ~ 2.81. 


It may seem pointless to seek the optimal algorithm for the multiplication of 
matrices. However, beyond the time that can be saved for large matrices (the 
Strassen algorithm has indeed been used on supercomputers), we shall see 
in the next section that the asymptotic complexity of matrix multiplication 
is equivalent to that of other operations clearly less trivial, such as matrix 
inversion. The Strassen algorithm relies on the following result. 


Lemma 4.3.1. The product of two matrices of order 2 may be done with 7 
multiplications and 18 additions (instead of 8 multiplications and 4 additions 
by the usual rule). 
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Proof. A simple computation shows that 


ab aB\ _ [m +m- m, + me m4 + ms 
cd yo} me + m7 mə — m3 + M5 —m7)’ 


with 


à= 


(a + b)ô. 


We count indeed 7 multiplications and 18 additions. 


=(b- 
=(a+ djia- E 
= (a—c) 


m3 = 
m4 


Remark 4.3.1. We note that the multiplication rule of Strassen in the above 
lemma is also valid if the entries of the matrices are not scalars but instead 
belong to a noncommutative ring. In particular, the rule holds for matrices 
defined by blocks. 


Consider then a matrix of size n = 2". We split this matrix into 4 blocks 
of size 2*—!, and we apply Strassen’s rule. If we count not only multiplications 
but additions too, the number of operations Nop(n) to determine the product 
of two matrices satisfies 


Nop(2*) = 7Nop (2-1) + 18(2-1)?, 


since the addition of two matrices of size n requires n? additions. A simple 
induction yields 


k-1 
Nop(2*) = 7*Nop(1) + 18 X 7141 < 7* (Nop(1) + 6). 
1=0 


We easily deduce that the optimal number of operations N,,(n) satisfies for 
all n, 
Nop(n) < Cn??? with logy 7 ~ 2.81. 


Since Strassen’s original idea, other increasingly complex algorithms have been 
devised, whose number of operations increases more slowly for n large. How- 
ever, the best algorithm (in terms of complexity) has not yet been found. As 
of today, the best algorithm such that Nop(n) < Cn® has an exponent a close 
to 2.37 (it is due to Coppersmith and Winograd). It has even been proved that 
if there exists an algorithm such that P(n) < Cn, then there exists another 
algorithm such that P(n) < C'n% with a’ < a. Unfortunately, these “fast” 
algorithms are tricky to program, numerically less stable, and the constant 
C in Nop(n) is so large that no gains can be expected before a value of, say, 
n = 100. 
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4.4 Equivalence of Operations 


We have just seen that for an operation as simple as the multiplication of 
matrices, there exist algorithms whose asymptotic complexities are quite dif- 
ferent (n? for the standard algorithm, n!°827 for Strassen’s algorithm). On the 
other hand, in most cases the best algorithm possible, in terms of complex- 
ity, is not known for mathematical operations such as matrix multiplication. 
Therefore, we cannot talk about the complexity of an operation in the sense 
of the complexity of its best algorithm. At most, we shall usually determine 
an upper bound for the number of operations (possibly improvable). Hence, 
we introduce the following definition. 


Definition 4.4.1. Let Nop(n) be the (possibly unknown) complexity of the best 
algorithm performing a matrix operation. We call the following bound 


Nop(n) < Cn% Yn>0, 


where C and a are positive constants independent of n, the asymptotic com- 
plexity of this operation, and we denote it by O(n®). 


An amazing result is that many of matrix operations are equivalent in 
the sense that they have the same asymptotic complexity. For instance, al- 
though, at first glance, matrix inversion seems to be much more complex, it 
is equivalent to matrix multiplication. 


Theorem 4.4.1. The following operations have the same asymptotic complex- 
ity in the sense that if there exists an algorithm executing one of them with a 
complexity O(n™) where a > 2, then it automatically yields an algorithm for 
every other operation with the same complexity O(n“): 


(i) product of two matrices: (A, B) —> AB, 

(ii) inversion of a matriz: At A`}, 
(iii) computation of the determinant: A — det A, 
(iv) solving a linear system: (A, b) > x = A™tb. 


Proof. The difficulty is that Theorem 4.4.1 should be proved without know- 
ing the algorithms, or the exact exponent a. We prove only the equivalence 
between (i) and (ii) (the other equivalences are much harder to prove). Let 
I(n) be the number of operations required to compute AT! by a given algo- 
rithm. We assume that there exist C and a such that I(n) < Cn®. Let us 
show that there exists an algorithm that computes the product AB whose 
number of operations P(n) satisfies P(n) < C'n“ with the same exponent a 
and C’ > 0. First of all, we note that 


-1 


IAO I-A AB 
OIB =ļ|0 I -B 
007 00 I 
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Consequently, the product AB is obtained by inverting a matrix that is 3 
times larger. Hence 
P(n) < n) < C3°%n™. 

Now let P(n) be the number of operations needed to compute AB by a given 
algorithm. We assume that there exist C and a such that P(n) < Cn®. Let 
us show that there exists an algorithm that computes A~! whose number of 
operations I(n) satisfies I(n) < C’n® with the same exponent a and C’ > 0. 
In this case, we notice that 


AB\ Y! A + A`1BA-ICA-! A BA! 
cp) = AICA! A-! (4.3) 


with A = D—CA™'B an invertible matrix (sometimes the called Schur 
complement). Therefore, to evaluate the inverse matrix on the left-hand side 
of (4.3), we can successively compute 


the inverse of A; 

the matrix X, = A`! B; 

the Schur complement A = D — CX; 

the inverse of A; 

the matrices Xə = X147, X3 = CAA, X4= A1 X3, and X5 = X2 X3. 
The left-hand side of (4.3) is then equal to 


te +X; X2 ) 
Sii i 


Xa A 
Since this method requires 2 inversions and 6 multiplications, we deduce that 
I(2n) < 2I (n) + 6P(n), 


if we neglect additions (for simplicity). By iterating this formula for n = 2%, 
we get 


k—1 k—1 
IQ") < #11) +6 X 207 Pec (2 +5 cael ; 
1=0 


i=0 
Since a > 2, we infer 
ig) <0". 
If n Æ 2* for all k, then there exists k such that 2* < n < 2*+!. We inscribe 
the matrix A in a larger matrix of size 2**1: 


~ [AO : zı [A ‘0 
A=(45) with A =( 0 ae 
where I is the identity of order 2*+1 — n. Applying the previous result to A 
yields 
I(n) < Cah < O'n? , 


which is the desired result. 
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4.5 Exercises 


Warning: for the following exercises do not use Matlab except where explicitly 
requested. 


4.1 (*). Let u and v be two vectors of R”, and let A and B be two square 
matrices of M,,(R). 


1. 


Find the numbers of operations required to compute the scalar product 
(u,v), the Euclidean norm ||u||2, and the rank-one matrix uv’. 


. Find the numbers of operations required to compute the matrix-vector 


product Au, and the matrix product AB. 


. For n = 100k, with k = 1,...,5, estimate the running time of Matlab 


(use the functions tic and toc) for computing the product of two matrices 
A=rand(n,n) and B=rand(n,n). Plot this running time in terms of n. 


. Assume that this running time is a polynomial function of n, so that for 


n large enough, T(n) ~ Cn*. In order to find a numerical approximation 
of the exponent s, plot the logarithm of T in terms of the logarithm of n. 
Deduce an approximate value of s. 


4.2 (*). In order to compute the product C = AB of two real square matrices 
A and B, we use the usual algorithm 


n 
Cig = X ai kbkj» I < tJ < n, 
k=1 


with the notation A = (dip )1<i,j<ns B= (bi j)i<ij<n, and C = (i,j 1<i,j<n- 


i; 


Prove that if A is lower triangular, then the computational complexity for 
the product C = AB is equivalent to n?/2 for n large (recall that only 
multiplications and divisions are counted). 


. Write, in pseudolanguage, an algorithm that makes it possible to compute 


the product C = AB of a lower triangular matrix A with any matrix B 
that has the computational complexity n3/2. 


. We assume henceforth that both matrices A and B are lower triangular. 


Taking into account their special structure, prove that the computational 
complexity for the product C = AB is equivalent to n3/6. 


. Write a function LowTriMatMult that performs the product of two lower 


triangular matrices, exploiting the sparse structure of these matrices. 
Compare the results obtained with those of Matlab. 


. Write a function MatMult that executes the product of two matrices (with- 


out any special structure). Compare the computational time of this func- 
tion with that of LowIriMatMult for computing the product of two lower 
triangular matrices. 


. Fix n = 300. Define a=triu(rand(n,n)) and b=triu(rand(n,n)). Find 


the running time tı for computing the product a*b. In order to ex- 
ploit the sparse structure of the matrices, we define sa=sparse (a), 
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sb=sparse(b). Find the running time tə for the command sa*sb. Com- 
pare tı and tə. 


4.3. Let A and B be square matrices of M,,(R), and u a vector of R”. 


1. If A is a band matrix (see Definition 6.2.1), compute the computational 
complexity for computing Au (assuming n large) in terms of the half- 
bandwidth p and of n. 

2. If A and B are two band matrices, of equal halfbandwidths p, prove that 
the product AB is a band matrix. Find the computational complexity for 
computing AB. 


4.4. Write a function Strassen that computes the product of two matrices 
of size n x n, with n = 2", by the Strassen algorithm. It is advised to use the 
recursiveness in Matlab, that is, the possibility of calling a function within 
its own definition. Check the algorithm by comparing its results with those 
provided by Matlab. Compare with the results obtained with the function 
MatMult. 


4.5. Let A, B, C, and D be matrices of size n x n. Define a matrix X of size 


2n x 2n by 
A B 
a, i 


We assume that A is nonsingular as well as the matrix A= D—CA~!B (A 
is called the Schur complement of X). Under these two assumptions, check 
that the matrix X is nonsingular, and that its inverse is 


— AICA! AT! (4.4) 


y (4 1 4 A`tBAICA t -ATIBA ) 
Compute the inverse of a 2n x 2n matrix by implementing (4.4) in Matlab. 
Use the command inv to compute the inverses of the blocks of size n x n, then 
try to minimize the number of block products. Compare your results with the 
standard command inv (X). 


5 


Linear Systems 


We call the problem that consists in finding the (possibly multiple) solution 
x € KP, if any, of the following algebraic equation 


Az = b (5.1) 


a linear system. The matrix A E€ Mn. p(K), called the “system matrix,” and 
the vector b € K”, called the “right-hand side,” are the data of the problem; 
the vector x € KP is the unknown. As usual, K denotes the field R or C. 
The matrix A has n rows and p columns: n is the number of equations (the 
dimension of b) and p is the number of unknowns (the dimension of x). 

In this chapter, we study existence and uniqueness of solutions for the lin- 
ear system (5.1) and we discuss some issues concerning stability and precision 
for any practical method to be used on a computer for solving it. 


5.1 Square Linear Systems 


In this section, we consider only linear systems with the same number of 
equations and unknowns: n = p. Such a linear system is said to be square 
(like the matrix A). This particular case, n = p, is very important, because 
it is the most frequent in numerical practice. Furthermore, the invertibility of 
A provides an easy criterion for the existence and uniqueness of the solution. 


Note that it is only in the case n = p that the inverse of a matrix can be 
defined. 


Theorem 5.1.1. If the matrix A is nonsingular, then there exists a unique 
solution of the linear system Ax = b. If A is singular, then one of the following 
alternatives holds: either the right-hand side b belongs to the range of A and 
there exists an infinity of solutions that differ one from the other by addition 
of an element of the kernel of A, or the right-hand side b does not belong to 
the range of A and there are no solutions. 
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The proof of this theorem is obvious (see [10] if necessary), but it does not 
supply a formula to compute the solution when it exists. The next proposition 
gives such a formula, the so-called Cramer formulas. 


Proposition 5.1.1 (Cramer formulas). Let A = (a1|...|an) be a nonsin- 
gular matrix with columns a; E€ R”. The solution of the linear system Ax = b 
is given by its entries: 


i det (a4| aaa la;—1| b lai+1| Aes jan) 
‘ det A i 


I<i<n. 


Proof. Since the determinant is an alternate multilinear form, we have for 
all 7 Fi, 

det (ai|... |as—1|Aai + pa; laizi|..- |an) = Adet A 
for all A and p in K. The equality Ax = b is equivalent to b = Jai Tiai, that 
is, the x; are the entries of b in the basis formed by the columns a; of the 
matrix A. We deduce that 


det (a1 | tae lail b \ai+1| sae jan) = ťi det A, 


which is the aforementioned formula. 


Let us claim right away that the Cramer formulas are not of much help 
in computing the solution of a linear system. Indeed, they are very expensive 
in terms of CPU time on a computer. To give an idea of the prohibitive 
cost of the Cramer formulas, we give a lower bound c, for the number of 
multiplications required to compute the determinant (by the classical row 
(or column) development method) of a square matrix of order n. We have 
Cn = (1+ ¢n-1) > Ncn- and thus cn > n! = n(n—1)(n—2)---1. Therefore, 
more than (n + 1)! multiplications are needed to compute the solution of 
problem (5.1) by the Cramer method, which is huge. For n = 50, and if the 
computations are carried out on a computer working at 1 gigaflop (i.e., one 
billion operations per second), the determination of the solution of (6.1) by 
the Cramer method requires at least 


51! 
(365 - 24 - 60 - 60) - (109) 


x 4.8 x 10°? years!!! 


Even if we use a clever way of computing determinants, requiring say the 
order of n“ operations, the Cramer formulas would yield a total cost of order 
n°t!, which is still prohibitive, since Theorem 4.4.1 claims that computing 
a determinant or solving a linear system should have the same asymptotic 
complexity. We shall study in Chapter 6 methods that require a number of 
operations on the order of n’, which is much less than n!; the same computer 
would take 1074 seconds to execute the n° operations for n = 50. Let us check 
that Matlab actually solves linear systems of size n in O(n?) operations. In 
Figure 5.1 we display the computational time required by the command A\b, 
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where the entries of A, an n x n matrix, are chosen randomly. The results are 
displayed with a log-log scale (i.e., In(time) in terms of In(n)) in Figure 5.1 
for 10 values of n in the range (100, 1000). If the dependence is of the form 
time(n) = an? + O(n?-'), a>0, 
taking the logarithms yields 
In(time(n)) = pIn(n) + In(a) + O(1/n). 


The plotted curve is indeed close to a line of slope 3; hence the approximation 
p = 3. In practice, the slope p is larger than 3 (especially for large values of 
n) because in addition to the execution time of operations, one should take 
into account the time needed to access the data in the computer memory. 


—1+ 
log(CPU) 
—3} 
—5} 
4) pe straight line with slope 3 
Giese log(n) 


45 5 55 6 65 7 
Fig. 5.1. Running time for solving Ax = b (command A\b of Matlab) in terms of n. 


To conclude, the Cramer formulas are never used, because they are too 
expensive in terms of computational time. It is an example, among many 
others, that an elegant concept from a theoretical viewpoint is not necessarily 
practical and efficient from a numerical viewpoint. 

The next chapters are devoted to various methods for solving linear sys- 
tems like (5.1). Before studying the general case, corresponding to any non- 
singular matrix, let us review some simple particular cases: 


X If A is a diagonal matrix, it is clear that the computation of the solution x 
of (5.1) is performed in just n operations. Recall that we take into account 
only multiplications and divisions and not additions and subtractions. 

X If Ais a unitary matrix, the solution is given by x = A~'b = A*b. Since 
transposition and complex conjugation require no operations, such a com- 
putation boils down to a matrix multiplied by a vector, which is performed 
in n? operations. 
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X If A is a triangular matrix (lower for instance), then (5.1) reads 


đ1,1 0 aes eae 0 Ly by 
a21 a22 T2 b2 
: Gan bn— 
Qn—1,1 An—1,2 Qn—-1n-1 U mel b i 
T 
an,1 an,2 sex An, n—1 ann i Ik 


The solution can be computed by the so-called forward substitution al- 
gorithm (Algorithm 5.1). We first compute zı = bı /a1,1; then, using this 
value of x1, we obtain £2 = (bə — a2 1x1)/a2,2; and so on up to £n. For an 
upper triangular matrix, we use a “back substitution algorithm,” i.e., we 
compute the entries of x in reverse order, starting from £n up to 2. 


Data: A,b. Output: x = A~‘b. 


Fori=1/n 
s=0 
For j=1 /i-1 
s = 8 + Ai jx; 
End j 
zi = (bi — s)/Ai,i 
End i 


Algorithm 5.1: Forward substitution algorithm. 


To compute the n entries of x, we thus perform 

e 1+2+---+n-— 1 = n(n — 1)/2 multiplications 

e n divisions 
This is a total number of order n?/2 operations. Note that this algorithm 
gives the solution x without having to compute the inverse matrix A7!: 
this will always be the case for any efficient algorithm for solving linear 
systems. 


Since it is very easy to solve a linear system whose matrix is triangular, 
many solution methods consist in reducing the problem to solving a triangular 
system. Examples of such methods are given in the next chapters. We shall 
remember at this stage that solving a triangular linear system requires (on 
the order of) n?/2 operations. 


Remark 5.1.1. To solve a linear system, it is not necessary to compute A7!. 
We have just seen this with the Cramer formulas and the case of a triangular 
matrix. In general, solving a linear system does not require that one compute 
the inverse matrix AT! because it is too expensive. 

Conversely, if we have at our disposal an algorithm to solve the linear 
system Ax = b, then we easily infer a method for determining the inverse 
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matrix A~*. Indeed, if we denote by (€;)1<i<n the canonical basis of K” and 
x; the solution of the system with e; as right-hand side (i.e., Ax; = e;), then 
the matrix with columns (2x;)1<j<n is nothing but Aq’. 


Remark 5.1.2. When solving a linear system, we use neither the diagonaliza- 
tion nor the triangularization properties of the matrix A. Obviously, if we 
explicitly knew such a factorization A = PT P71, solving the linear system 
would be easy. However, the trouble is that computing the diagonal or trian- 
gular form of a matrix comes down to determining its eigenvalues, which is 
much harder and much more costly than solving a linear system by classical 
methods. 


Remark 5.1.3. If one wants to minimize the storage size of a triangular (or 
symmetric) matrix, it is not a good idea to represent such matrices A € 
M,,(R) by a square array of dimension n x n. Actually, half the memory space 
would suffice to store the n(n + 1)/2 nonzero entries of A. It is thus enough 
to declare a vector array STOREA of dimension n(n + 1)/2 and to manage 
the correspondence between indices (i,j) and index k such that A(i, j) = 
STOREA(k). We easily check that k(i, j) = j7+7i(i—1)/2 if the upper triangular 
matrix A is stored row by row; see Exercise 5.3. 


5.2 Over- and Underdetermined Linear Systems 


In this section we consider linear systems of the form (5.1) with different 
numbers of equations and unknowns, n 4 p. When n < p, we say that the 
system is underdetermined: there are more unknowns than equations (which 
allows more “freedom” for the existence of solutions). When n > p, we say 
that the system is overdetermined: there are fewer unknowns than equations 
(which restricts the possibility of existence of solutions). In both cases, let us 
recall a very simple but fundamental result [10]. 


Theorem 5.2.1. There exists at least one solution of linear system (5.1) if 
and only if the right-hand side b belongs to the range of A. The solution is 
unique if and only if the kernel of A is reduced to the zero vector. Two solutions 
differ by an element of the kernel of A. 


When n Æ p, there is no simpler criterion for the existence of solutions to a 
linear system. We can only indicate, in a heuristic fashion, that it is more likely 
for these to be solutions to an underdetermined system than to an overdeter- 
mined one. Let us recall in any case the following obvious consequence of the 
rank theorem. 


Lemma 5.2.1. If n < p, then dim Ker A > p—n > 1, and if there exists a 
solution of the linear system (5.1), there exists an infinity of them. 
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To avoid this delicate existence issue of the solution to a nonsquare linear 
system, there is another way of looking at it by considering it as a “least 
squares problem,” in other words, a “generalized” or approximate solution, in 
the sense of least squares fitting, is called for. Chapter 7 is dedicated to this 
topic. 


5.3 Numerical Solution 


We now focus on the practical aspects for solving numerically a linear system 
on a computer. We have already seen that solution algorithms have to be very 
efficient, that is, fast (by minimizing the number of performed operations), 
and sparing memory storage. Furthermore, there is another practical requisite 
for numerical algorithms: their accuracy. Indeed, in scientific computing there 
are no exact computations! As we shall see below in Section 5.3.1, a computer’s 
accuracy is limited due to the number of bits used to represent real numbers: 
usually 32 or 64 bits (which makes about 8 or 16 significant digits). Therefore, 
utmost attention has to be paid to the inevitable rounding errors and to their 
propagation during the course of a computation. The example below is a 
particularly striking illustration of this issue. 


Example 5.3.1. Consider the following linear system: 


8641 19 1 
1451 11 1 
8411] a 
1436 14 1 


8641 19.01 ~2.34 
1451 11.05 9.745 
8s 4 1 1|% |1407 | | -4.85 
1436 14.05 ~1.34 


This example shows that small errors in the data or in intermediate results 
may lead to unacceptable errors in the solution. Actually, the relative error 
in the solution, computed in the ||.||,, norm, is about 2373 times larger than 
the relative error on the right-hand side of the equation. This amplification of 
errors depends on the considered matrix (for instance for the identity matrix 
there are no amplifications of errors). One has therefore to make sure that 
numerical algorithms do not favor such an amplification. Such a property is 
called stability. 


Remark 5.3.1. Numerical methods (or algorithms) for solving linear systems 
have to be at the same time efficient and stable. This is really a crucial issue, 
especially for so-called iterative methods (see Chapter 8). Their name is in 
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opposition to direct methods (see Chapter 6), which would compute the exact 
solution if there were no rounding errors (perfect or exact arithmetic). On the 
other hand, iterative methods compute a sequence of approximate solutions 
that converges to the exact solution: in such a case, stability is a necessary 
condition. 


5.3.1 Floating-Point System 


We briefly discuss the representation of numbers in digital computers and its 
associated arithmetic, which is not exact and is of limited accuracy, as we 
anticipated. Since digital computers have a finite memory, integers and real 
numbers can be (approximately) represented by only a finite number of bits. 
Let us describe a first naive representation system. 


Fixed-Point Representation. Suppose that p bits are available to code 
an integer. Here is a simple way to do it. The first bit is used to indicate 
the sign of the integer (0 for a positive integer and 1 for a negative one), 
the p — 1 other bits contain the base-2 representation of the integer. For 
example, for p = 8, the positive integers 11 = 1 x 23 +1 x 2'+1 x 2° and 
43 = 1 x 2° +1x234+1x2!+1 x 2° are encoded as 


0/0/0|0|1/0|1|1| and = jO/O}1j0}1)0}1)1 


For negatives integers, the complement representation can be used: it consists 
in reversing the bits (0 becomes 1 and 1 becomes 0). For example, —11 and 
—43 are encoded as 


1/1/1|1/0|1 00} and = |1)1]0}1/0)1)0)0 


Integers outside the interval [—2?~',2?-! — 1] cannot be encoded in the fixed- 
point representation, which is a severe limitation! The same difficulty arises 
for real numbers too. Therefore, there is a need for another representation. 
We now describe a more elaborate representation system, which is used by all 
computers dedicated to scientific computing. 


Floating-Point Representation. For given integers b, p, Mmin, and Nmax, 
we define the floating-point numbers as real numbers of the form 


an (0.dı sae dp) x b”, 


with dı 4 0,0 < d; < b—1, and —Nmin < Nn < Nmax. We denote by F the 
(finite) set of all floating-point numbers. In this notation, 


1. b is the base. The most common bases are b = 2 (binary base), b = 10 
(decimal base), and b = 16 (hexadecimal base). 

2. n € [—Nmin,; Nmax] is the exponent that defines the order of magnitude of 
the numbers to be encoded. 
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3. The integers d; € [0,b— 1] are called the digits and p is the number of 
significant digits. The mantissa or significand is the integer m = d,...dp. 
Note that 


Pp 
WHO = (Od) .. . dp) xX b” =b” yah. 
k=1 


The following bounds hold for floating-point numbers 


Qmin < la < Qmax; Va € F, 
where amin = b7 mint!) corresponds to the case dı = 1, dọ = dp = 0, 
and n = —Nmin, max = 0"™**(1 — b7?) corresponds to the case n = Nmax and 
dı =...d, = (b— 1). In other words, amin is the smallest positive real number, 


and amax the largest one, that can be represented in the set of floating-point 
numbers. Smaller numbers produce an underflow and larger ones an overflow. 
Computers usually support simple precision (i.e., representation with 32 bits) 
and double precision (i.e., representation with 64 bits). In the single-precision 
representation, 1 bit is used to code the sign, 8 bits for the exponent, and 23 
bits for the mantissa (for a total of 32 bits). In the double-precision represen- 
tation, 1 bit is used to code the sign, 11 bits for the exponent, and 52 bits for 
the mantissa (for a total of 64 bits). The precise encoding is system dependent. 
For example, in a single precision representation, 356.728 is encoded 


0/|0/0}0)0/0}0|0)3}}3)5|6)7}2}8/0]0}0}0/0}0}0}0)0|0/0}0)0}0 0;0)0 


in the decimal base and 81.625 = 26 + 244 2° + 271 + 273 = (271 +273 + 
2-7 + 278 + 2-10)97 is encoded 


0/|0/0}0)0}0} 1} 1)1}}1)0}1)0}0)0}1]1)0)1/0}0}0)0)0}0}0}0)0}0 O;O)0 


in the binary base. In practice, the floating-point representation is more elabo- 
rate than these simple examples. For example, in the binary base, the first digit 
of the mantissa is always 1, hence it is not necessary to store it. By the same 
token, the exponent is encoded as an unsigned number by adding to it a fixed 
“bias” (127 is the usual bias in single precision). Let us consider again the real 
number 81.625 written this time as 81.625 = (2° + 2-2 +276 +277 + 279)26, 
The biased exponent to be stored is 127 + 6 = 27 + 2? + 2°, and the complete 
encoding is (compare with the previous encoding) 


0//1/010/0/0|110/1|/0/110/0|0111110/110/0/0/0/01010|010/0/0/0/0]0 


The mapping from real number to floating-point numbers is called the 
floating-point representation or the rounding. Let fl(x) be the floating-point 
number associated to the real number x. The following equality holds for all 
real numbers z € [amin, @max] U {0}: 


fl(x) = 2(1 +8), 
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with |e] < Emachines ANd Emachine = sbi, a number called the machine 
precision. Typically, for a computer with binary 32-bit single precision (b = 2, 
p = 23), the machine precision is machine = 27°? © 11.9 x 1078, while for 
a binary 64-bit double precision computer Emachine = 275? ~ 2.2 x 10716 
(the exponents in €machine explain the 8 or 16 significant digits for single or 
double-precision arithmetic). Concerning other properties of the floating-point 
numbers (for example their distribution and the effect of rounding), we refer 
the reader to the sections in [6], [13], and [15] devoted to the floating-point 
system. 

Floating-Point Arithmetic. A key issue is to quantify the precision of the 
computer realization of an elementary arithmetic operation. Let us consider 
the case of the operation +. The same holds true for the other operations —, 
x, and +. Of course, the sum of two floating-point numbers is usually not 
a floating-point number. We denotes by + the computer realization of the 
addition: for real numbers x and y, 


zty = fl(fl(x) + fl(y)). 


Unlike the operation +, this operation is not associative: (a+y)+z2 4 a+(y+z). 
Overflow occurs if the addition produces a too-large number, |a+y| > max; 
and underflow occurs if it produces a too-small number, |x+y] < amin. Most 
computer implementations of addition (including the widely used IEEE arith- 
metic) satisfy the property that the relative error is less than the machine 
precision: 


(x+y) —(a@+y) 


< Emachine 
u+y i 


assuming z +y #0 and amin < |z|, |y| < Gamax. Hence the relative error on one 
single operation is very small, but this is not always the case when a sequence 
of many operations is performed. 

We will not study the precise roundoff, or error propagation, of vectorial 
operations (such as scalar product and matrix-vector product) or numerical 
algorithms presented in this book, and we refer the reader to [7] for more 
details. 

Let us conclude this section by saying that underflow and overflow are 
not the only warning or error messages produced by a floating-point represen- 
tation. Forbidden operations (like dividing by zero) or unresolved operations 
(like 0+0) produce, as an output, NaN (which means not a number) or Inf (in- 
finity). In practice, obtaining a NaN or Inf is a clear indication that something 
is going wrong in the algorithm! 


5.3.2 Matrix Conditioning 


To quantify the rounding error phenomenon, we introduce the notion of matrix 
conditioning. It helps to measure the sensitivity of the solution x of the linear 
system Ax = b to perturbations of the data A and b (we assume that A is 
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a nonsingular square matrix and that b is nonzero). Let € > 0 be a small 
parameter of data perturbation. We define A; and b-, perturbations of A and 
b, by 

A-=A+eB, Be M,(R), be = b+ey, yE K”. (5.2) 
Since A is nonsingular, As is also nonsingular for € small enough (see Remark 
3.3.2), and we denote by ze the solution of the system 


Acie = be. (5.3) 


We remark that A>! = (I+¢A~!B)~1A™}, and using Proposition 3.3.1 for € 
small, we have (I +¢A7~1B)-! = I — £sA7 1B + O(e?). Consequently, we can 
write an asymptotic expansion of xs in terms of €: 


£e = (I+ eA 1B)'A*(b + e7) 
= |I-eA'B+O(e”)|(a@+eA7'y) 
=a£+eA7"(y— Bz) + Oe’), 
where O(e7) denotes a vector y € K” such that ||y|| = O(c?) in a given vector 


norm. Noting that ||b|| < ||Al| ||æ||, we have the following upper bounds for a 
vector norm and its corresponding matrix norm: 


= B|| 
Le — z|] < ella] || A71 a fL EN 4 ot, 5.4 
|l I| < ellx|| A“ | All jell + Al (e°) (5.4) 


Definition 5.3.1. The condition number of a matrix A E€ Mn(K), relative to 
a subordinate matrix norm ||.||, is the quantity defined by 


cond(A) = ||Al| ||A7"I]- 


Note that we always have cond(A) > 1, since 1 = ||J|| = | AAH! < 
|| A|| || A7+]]. Inequality (5.4) reads then as 
ze — =| { Ae -All Ilbe — bll \ 2 
-———. < cond(A + O(e*). 5.5 
lal TAT a Ee Ae) 


This upper bound shows that the relative error (to first-order in €) in x is 
bounded from above by cond(A) times the relative error in A and b. The 
condition number cond(A) thus measures the conditioning or the sensitivity 
of the problem Ax = b to perturbations in the data A or b. Even if the 
relative error in data A and b is small, the relative error in the solution x 
may be large if the quantity cond(A) is large. In other words, the condition 
number measures the amplification of errors in the data (right-hand side b or 
matrix A). We can establish a more accurate upper bound than (5.5) if we 
perturb only one datum, b or A. 


Proposition 5.3.1. Let A be a nonsingular matriz and b 4 0 a vector. 
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1. If x and x+ 6x are respectively the solutions of the systems 
Ax =b and A(x + x) = b + ôb, 
we have 


||oar|| 


lell = 


ond) (5.6) 


2. If x and x + 6x are respectively the solutions of the systems 


Ax =b and (A+ 6A)(a+ ôx) = b, (5.7) 
ear ee IõAI 
i hi 
——— < cond(A)———. 
Je + özl Tal 


Furthermore, these inequalities are optimal. 


Proof. To prove the first result, we observe that Ax = ôb implies that 
\|da|| < || ATH] - ||6||. However, we also have ||b|| < ||Al|||a|], which yields 


[oar 
lall = 


|. 
[al 


ond(A) 
This inequality is optimal in the following sense: for every matrix A, there 
exists 6b and x (which depend on A) such that 


se 55) 
= cond(A)——. 5.8 
[el (4) Tal ae 


In fact, according to a property of subordinate matrix norms (cf. Proposition 
3.1.1 in Chapter 3) there exist zo Æ 0 such that || Azo|| = || A|| ||zol| and zı 4 0 
such that || A~tay|| = || A~+|| ||xz1]|. For 6 = Axo and 6b = 21, we have x = xo 
and 6x = A` tz, and equality (5.8) holds. 

To obtain the second result, we observe that 


Asx + 6A(x + ôx) = 0 > |[dx|| < || ATH] + ||6Al]|la + dal], 


from which we deduce 


I|2|| 
ja + da] — 


|All 
ond(A) JAT 
To prove the optimality, we show that for any matrix A, there exist a per- 
turbation 6A and a right-hand side b that satisfy the equality. Thanks to 
Proposition 3.1.1 there exists y # 0 such that || A~ty|] = || A71||||y|]. Let £ be 
a nonzero scalar. We set 6A = eI and b = (A+ 06A)y. We then check that 
y = y +x and 6x = —cA~'y, and since ||6A|| = |e|, we infer the desired 
equality. 
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In practice, the most frequently used conditionings are 
cond,(A) = ||Allp||A~*l|p for p = 1,2, +00, 


where the matrix norms are subordinate to the vector norms ||.||p. For in- 
stance, for the matrix in Example 5.3.1, we have cond..(A) ~ 5367, which 
accounts for the strong amplification of small perturbations of the right-hand 
side on the solution. Let us note at once that the upper bound (5.6), while 
optimal, is in general very pessimistic; see Remark 5.3.3. 

We now establish some properties of the condition number. 


Proposition 5.3.2. Consider a matriz A E€ M,(C). 


1. cond(A) = cond(A7'), cond(@A) = cond(A) Va 4 0. 
2. For any matriz A, 
pı(A) 


cond2(A) int)? (5.9) 
where un(A) and uı(A) are respectively the smallest and the largest sin- 
gular values of A. 

3. For a normal matrix A, 


|Amax(A)| = 
cond2(A) = ————— = o(A)o( A` >`), 5.10 
where |Amin(A)| and |Amax(A)| are respectively the modulus of the smallest 
and largest eigenvalues of A. 
4. For any unitary matrix U, conda (U) = 1. 
5. For any unitary matrix U, condg(AU) = condg(UA) = cond2(A). 


The proof of this proposition follows directly from the properties of the sub- 
ordinate norm ||- ||2. 


Remark 5.3.2. Equality (5.10) is optimal, in the sense that for any matrix 
norm, we have 


cond(A) = ||Al| ||A~"|| = @(A)e(A~"). 


In particular, for a normal matrix A, we always have cond(A) > cond2(A). 

A matrix A is said to be “well conditioned” if for a given norm, cond(A) ~ 
1; it is said to be “ill conditioned” if cond(A) >> 1. Unitary matrices are 
very well conditioned, which explains why one has to manipulate, whenever 
possible, these matrices rather than others. 


Since all norms are equivalent in a vector space of finite dimension, the 
condition numbers of matrices in Mp (K) are equivalent in the following sense. 


Proposition 5.3.3. Conditionings cond,, cond2, and cond. are equivalent: 


n-'condg(A) < cond;(A) < ncondg(A), 
n—'cond.(A) < cond2(A) < ncond (A), 
n~? condı(A) < cond,,(A) < n? cond; (A). 
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Proof. The inequalities follow from the equivalences between the matrix 
norms ||- ll, || + l2, and ||- |oo, which in turn follow from the equivalences 
between the corresponding vector norms. 


Remark 5.3.8. The upper bound (5.6), while optimal, is in general very pes- 
simistic, as is shown by the following argument, based on the SVD decompo- 
sition. Let A = VX7U* be the SVD decomposition of the nonsingular square 
matrix A with X = diag (ui) and ui > --- > un > 0. We expand b (respec- 
tively x) in the basis of columns v; of V (respectively u; of U) 


n n 
b= y bivi, t= y Tiui. 
i=l i=1 


We consider a perturbation 6b that we write in the form 5b = eļjbll2 X ;—4 div, 
with « > 0 and X; |6;/? = 1, so that ||ðbll2 = eljbl|2. Let us show that 
equality in (5.6) may occur only exceptionally if we use the Euclidean norm. 
Observing that Au; = uivi, we have 


aj=— and 6a; =ellbllo—, 
Hi Hi 


and since the columns of U and V are orthonormal, the equality in (5.6) occurs 
if and only if 


n ôi 2 2 
eogal at, 
lællg Hn 
Setting ci = bi/||b||2, this equality becomes 
Ĝi 
= Hi ‘De < 
that is, 
n-1 u2 
lôn]? + <z l]? = ary s al. 
ja Mi 


And since X; |6;|? = Xa lal? = 1, we have 
n-1 2 n 2 

H 2 Hi 2 
>, ) >, H? 


Since the left sum is nonpositive and the right sum is nonnegative, both sums 
are zero. Now, all the terms of these two sums have the same sign; hence every 
term of these sums is zero: 


2 
(4-1) i? =o.ana (44-1) |e? =0 fort sisn, 
H H 


i i 
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We deduce that if uw; Æ Un then 6; = 0, and if u; ~ py then c; = 0. In other 
words, the equality in (5.6) may occur only if the right-hand side b belongs to 
the first eigenspace (corresponding to u1) of A*A and if the perturbation 6b 
belongs to the last eigenspace (corresponding to Hn) of A* A. This coincidence 
seldom takes place in practice, which accounts for the fact that in general, 
— is much smaller than its upper bound cond(A) LH. See on this topic an 


example in Section 5.3.3 below. 


Geometric interpretation of cond2(A). We have seen (see Figure 2.2) that 
the range, by the matrix A, of the unit sphere of R” is an ellipsoid, whose 
semiaxes are the singular values u;i. Proposition 5.3.2 shows that the (2-norm) 
conditioning of a matrix measures the flattening of the ellipsoid. A matrix is 
therefore well conditioned when this ellipsoid is close to a sphere. 


Another interpretation of cond2(A). The condition number cond2(A) of 
a nonsingular matrix A turns out to be equal to the inverse of the relative 
distance from A to the subset of singular matrices in M,,(C). Put differently, 
the more a matrix is ill conditioned, the closer it is to being singular (and 
thus difficult to invert numerically). 


Lemma 5.3.1. The condition number cond2(A) of a nonsingular matrix A 
is equivalently defined as 


cond,(A) BES, | All2 


where Sn is the set of singular matrices of M,(C). 
Proof. Multiplying the above equality by ||A||2 we have to prove that 


1 


(a eee |e (5.11) 


If there were B € S,, such that ||A — B||z < 1/||A7+||2, we would have 
ATA — B)ll2 < AA [lz |A- Blle < 1, 

and by Proposition 3.3.1 (and Lemma 3.3.1) the matrix J — AT! (A — B) = 

A~!B would be nonsingular, whereas by assumption, B € S,,. Hence, we have 


1 
inf {|| A — Blļl2} > ——_. 
atil ll2} = | A-? |I2 


Now we show that the infimum in (5.11) is attained by a matrix Bọ € Sn 
satisfying 


1 
|A -= Bolle = rr: 
[A= |2 
By virtue of Proposition 3.1.1 there exists a unit vector u € C”, |lull2 = 1, 
such that ||A~*||2 = |A tul]|2. Let us check that the matrix 
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u(A~tu)* 
Bo = A- ———_> 
|A- 
satisfies the condition. We have 
1 a 1 ||u(A~*u)*2||2 
|A- Bolle = mrz llu(47tu)* l2 = max l 
|A~*[15 |A- 240 læll2 


Since u(A~tu)*x = (x, A~tu)u and ||A~tull2 = ||A~*]|2, we deduce 


1 | (z, Au) | _ |A tull 1 
|A — Boll2 = = max = = = — š 
|A-"|3 240 ||all2 lam A~e 
The matrix Bo indeed belongs to Sy, i.e., is singular because A~'u 4 0 and 
J u(A~tu)*Aqtu (A-tu, A™ tuju 
BA tu= u =u : = 0. 
[A= [A-3 


Remark 5.3.4 (Generalization of the conditioning). If a nonzero matrix A is 
not square or singular, we define its condition number relative to a given norm 
by 

cond(A) = || A|| AŤ], 


where A? is the pseudoinverse of the matrix A. By Definition 2.7.2 we have 
~ 1 
lAl = 272 = —, 
Hp 


where up is the smallest nonzero singular value of the matrix A. Denoting by 
111(A) the largest singular value of A, we obtain the following generalization 
of (5.9): 


condə(A) = ) : 


5.3.3 Conditioning of a Finite Difference Matrix 


We return to the differential equation (1.1) and its discretization by finite 
differences (see Section 1.1). When the coefficient c(x) is identically zero on 
the interval [0,1], (1.1) is called the “Laplacian.” In this case, the matrix 
An, resulting from the discretization by finite difference of the Laplacian, and 
defined by (1.2), reads 


2-10... 0 
Ayn =n? Qs Se = Oo : (5.12) 


0... 0 =L 2 
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Lemma 5.3.2. For any n > 2, the linear system A,u™ = b™ has a unique 
solution. 


Proof. It suffices to show that A, is nonsingular. An easy calculation shows 
that 


n-1 n—1 
2 2{,2,,2 2 
(Anv, v) = X cu; +n’ | vi tun_i4 (ü =ü) s 
i=1 


i=2 
which proves that the matrix An is positive definite, since c; > 0. Hence this 
matrix is nonsingular. 


Its particularly simple form allows us to explicitly determine its 2-norm 
conditioning. Recall that the reliability of the linear system solution associated 
with An, and thus of the approximation of the solution of the Laplacian, is 
linked to the condition number of A,,. Being symmetric and positive definite 
(see the proof of Lemma 5.3.2), its 2-norm conditioning is given by (5.10), so 
it is equal to the quotient of its extreme eigenvalues. 

We thus compute the eigenvalues and secondarily the eigenvectors of Apn. 
Let p = (p1, . - - , Pn—1)* be an eigenvector of An corresponding to an eigenvalue 
à. Equation Anp = Ap reads, setting h = 1/n, 


—pr-1 + (2 — Ah?) pe — Pepa = 0, 1<k<n-l, (5.13) 


with po = pn = 0. We look for special solutions of (5.13) in the form pz = 
sin(ka), 0 < k < n, where a is a real number to be determined. Relation 
(5.13) implies that 


{2 — 2 cosa — \h?} sin(ka) = 0. 
In particular, for k = 1, 
{2 — 2 cosa — àh? } sin(a) = 0. 


Since a is not a multiple of 7, i.e., sin(a) # 0 (otherwise p would be zero), we 
infer that 

2(1— cosa) 4sin?(a/2) 

h? 7 h? 

Moreover, the boundary conditions, pp = 0 and pn = 0 imply that sin(na) = 0. 
Therefore we find (n — 1) different possible values of a, which we denote by 
ay = lr/n for 1 < €<n-—1, and which yield (n — 1) distinct eigenvalues of 
An (i.e., all the eigenvalues of An, since it is of order n — 1): 


A= 


4 of, T 
A = 72 sin (ex). (5.14) 


Each eigenvalue Ag is associated with an eigenvector pe whose entries are 


n—-1 
(sin[¢kr/n]) . The n — 1 eigenvalues of A, are positive (and distinct), 
k=1 
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0 < Ay <- < An_1, which is consistent with the fact that the matrix Ap is 
positive definite. The condition number of the symmetric matrix An is thus 


sin? ($45) 
conds(A,) = — sia 
sin (5) 
When n tends to +00 (or h tends to 0), we have 


4n? 4 

condə(An) * Ta ee 
We deduce that limp—+oo condo( An) = +00, so Ap is ill conditioned for n 
large. We come to a dilemma: when n is large, the vector un (discrete solution 
of the linear system) is close to the exact solution of the Laplacian (see The- 
orem 1.1.1). However, the larger n is, the harder it is to accurately determine 
Un: because of rounding errors, the computer provides an approximate solu- 
tion a, that may be very different from u‘”) if we are to believe Proposition 
5.6, which states that the relative error on the solution is bounded from above 
as follows: 


ALM Io 4 9 |Ab™ [lo 
ll z? Jela ’ 


2 
———______"" < cond(An 
Eo ee 


where Ab”) measures the variation of the right-hand side due to rounding 
errors. This upper bound is however (and fortunately) very pessimistic! Ac- 


tually, we shall show for this precise problem, when the boundary conditions 
are u(0) = u(1) = 0, that we have 


Au) lo Ab IIo 
lula = YB Jfa” 


(5.15) 


with a constant C independent of n. This outstanding improvement of the 
bound on the relative error is due to the particular form of the right-hand 
side b of the linear system Apu = b™). Let us recall that b is obtained 
by discretization of the right-hand side f(a) of (1.1), that is, ph” = f(x). 
We have hl|b(™||2 = kT" f2(ih), and we recognize here a Riemann sum 
discretizing the integral Ni f?(x)dx. Since the function f is continuous, we 
know that 


1 
lim meg= f f?(a)dex. 
n— +00 0 
Similarly, 
1 
Üm nyu = f Pajda 
n— +00 0 


Recall that Au™ = A>1Ab™, and hence || Au |j < || Aztl Ab™ ||2. Thus 
we deduce the following upper bound for the relative error on the solution 
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JAG) [gL fz [Ab] 
lula Aa lull Ifo” 


Aut l2 = 
— < |A; 
lu] 7 An ll 


since || A7 t||2 = 1/A, © 7?, according to (5.14). Using the above convergence 
of Riemann sums we claim that 


1/2 
1 Wal af So £(a)de 
yee, OV ps i 
Ar lu |2 Jo u? (x) dx 


which yields the announced upper bound (5.15) with a constant C independent 
of n. 


5.3.4 Approximation of the Condition Number 


In general, it is too difficult to compute the condition number of a matrix A 
exactly. For the conditioning in the 1- or co-norm, we have simple formulas 
at our disposal for the norm of A (see Proposition 3.1.2), but it requires the 
explicit computation of AT! to find its norm, which is far too expensive for 
large matrices. Computing the 2-norm conditioning, given by formula (5.9) or 
(5.10), is also very costly because it requires the extreme singular values or 
eigenvalues of A, which is neither easy nor cheap, as we shall see in Chapter 
10. 

Fortunately, in most cases we do not need an exact value of a matrix 
condition number, but only an approximation that will allow us to predict 
beforehand the quality of the expected results (what really matters is the 
order of magnitude of the conditioning more than its precise value). We look 
therefore for an approximation of cond,(A) = ||A7'||p ||Allp. The case p = 1 
is handled, in Exercise 5.13, as a maximization of a convex function on a 
convex set. The case p = oo is an easy consequence of the case p = 1 since 
cond,.(A) = cond;(A*). In both cases (p = 1 and p = œ) the main difficulty 
is the computation of || A~*||,. 

We now focus on the case p = 2. Consider the SVD decomposition of 
the nonsingular matrix A € M,,(R), A = VU", where X = diag (ui) with 
Hı > ++: > Un > O, the singular values of A. It furnishes two orthonormal 
bases of R”: one made up of the columns u; of the orthogonal matrix U, 
the other of the columns v; of the orthogonal matrix V. We expand a vector 
x € R” in the v; basis, x = )7, zivi. Since Au; = pivi, we have the following 
relations: 


|| Az| 
lle = max To = 7a = lAa (5.16) 
and ja i 
a E N (5.17) 
z#0 | [2 Hn 


We can thus easily determine the norm of A by computing the product Au, 
and the norm of A~! by solving a linear system Ax = vp. The trouble is that 
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in practice, computing wu, and vp is a very costly problem, similar to finding 
the SVD decomposition of A. 

We propose a heuristic evaluation of ||A||2 and ||A~+||2 by restricting the 
maximum in (5.16) and (5.17) to the subset {x € R”, x; = +1}. The scheme 
is the following: we compute an approximation a of ||A||2, an approximation 
6 of ||A~+]|2, and deduce an approximation a3 of condə (A). Note that these 
approximations are actually lower bound for we make a restriction on the 
maximization set. Hence, we always get a lower bound for the conditioning 


cond(A) = ||All2 || A-'|l2 > a. 


In practice, we can restrict our attention to triangular matrices. Indeed, we 
shall see in the next chapter that most (if not all) efficient algorithms for 
solving linear systems rely on the factorization of a matrix A as a product of 
two simple matrices (triangular or orthogonal) A = BC. The point in such a 
manipulation is that solving a linear system Az = b is reduced to two easy 
solutions of triangular or orthogonal systems By = b and Cx = y (see Section 
5.1). The following upper bound on conditioning holds: 


cond(A) < cond(B) cond(C). 


The condition number cond, of an orthogonal matrix being equal to 1, if we 
content ourselves with an upper bound, it is enough to compute condition 
numbers for triangular matrices only. We shall therefore assume in the sequel 
of this section that the matrix A is (for instance, lower) triangular. 


Data: A. Output: r ~ ||All2 
tı = Eyi = 41,1 
Fori=2 /7n 
s=0 
For j=1 Zi—1 
S = S F Qi jj 
End j 
If laii + s| > laii = s| 
then 
otherwise 
End If 
Yi = Ait, + S 
End i 
r = |lylle/vn 


Algorithm 5.2: Approximation of || All. 


Approximation of ||A||2. When A is lower triangular, the entries of y = Ax 
are deduced (from i = 1 to n) from those of x by the formulas 
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i—l1 
Yi = Ai iti + y Qi jÜj. 
j=1 


A popular computing heuristic runs as follows. We fix xı = 1 and accordingly 
Y1 = 41,1. At each next step i > 2, we choose x; equal to either 1 or —1, in 
order to maximize the modulus of y;; see Algorithm 5.2. Observe that we do 
not maximize among all vectors x whose entries are equal to +1, since at each 
step 7 we do not change the previous choices x, with k < i (this is typical of 
so-called greedy algorithms). Since the norm of x is equal to yn, we obtain 
the approximation || All2 ~ ||y||2//n. 


Approximation of ||A~'|{2. Similarly, we seek a vector x whose entries 2; 
are all equal to +1 and that heuristically maximizes the norm of y = A~!z. 
Since A is lower triangular, y will be computed by the forward substitution 
algorithm previously studied (see Algorithm 5.1). The proposed computing 
heuristic consists in fixing xı = 1 and choosing each entry x; for i > 2 equal 
to 1 or —1 so that the modulus of the corresponding entry y; is maximal. It 
yields the approximation ||A~+]|2 ~ |lyl|2//n; see Algorithm 5.3. 


Data: A. Output: r ~ || A7} |2. 
y = 1/a1,1 
Fori=2 /7n 
s=0 
For j=1 Zi—1 
s = 8 + Qi jYj 


End j 

yi = = (sign (8) + s)/as, 
End i 
r = |lylle//n 


Algorithm 5.3: Computation of || A~+|J2. 


Finally, Algorithm 5.4 computes an approximation, at low cost, of the 2- 
norm condition number of a matrix. We can arguably criticize Algorithm 5.4 
(see the numerical tests in Exercise 5.12) for being based on a local criterion: 
each entry y; is maximized without taking into account the other ones. There 
exist less local variants of Algorithm 5.4 in the sense that they simultaneously 
take into account several entries. 


Data: A. Output: c ~ cond2(A). 
e compute rı by Algorithm 5.2 
e compute r2 by Algorithm 5.3 
e set c = Tır2. 


Algorithm 5.4: Approximation of cond2(A). 
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5.3.5 Preconditioning 


Instead of solving a linear system Ax = b with an ill-conditioned matrix A, 
it may be more efficient to solve the equivalent linear system C~!Axr = C~!b 
with a nonsingular matrix C that is easily invertible and such that C~!A is 
better conditioned than A. All the trouble is to find such a matrix C, called 
a preconditioner. The best choice would be such that CTIA is close to the 
identity (whose conditioning is minimal, equal to 1), that is, C is close to A, 
but computing AT! is at least as difficult as solving the linear system! 

We already know that conditioning is important for the stability and sensi- 
tivity to rounding errors in solving linear systems. We shall see later, in Chap- 
ter 9, that conditioning is also crucial for the convergence of iterative methods 
for solving linear systems (especially the conjugate gradient method). Thus 
it is very important to find good preconditioners. However it is a difficult 
problem for which there is no universal solution. Here are some examples. 


X Diagonal preconditioning. The simplest example of a preconditioner 
is given by the diagonal matrix whose diagonal entries are the inverses 
of the diagonal entries of A. For example, we numerically compare the 
conditionings of matrices 


A= e o and B = D~'A, where D = diag (8,50), 


for which Matlab gives the approximate values 
cond2(A) = 6.3371498 and condə(B) = 1.3370144. 


Thus, the diagonal preconditioning allows us to reduce the condition num- 
ber, at least for some problems, but certainly not for all of them (think 
about matrices having a constant diagonal entry)! 

X Polynomial preconditioning. The idea is to define C~! = p( A), where 
p is a polynomial such that cond(C~!A) < cond(A). A good choice of 
C1 = p(A) is to truncate the expansion in power series of A~!, 


yta (r-W-A4)) =A 


k>1 


which converges if |Z — A|| < 1. In other words, we choose the polyno- 
mial p(x) = 1 + yea — x)". We suggest that the reader program this 
preconditioner in Exercise 5.15. 

X Right preconditioning. Replacing the system Ar = b by C7!Ar = 
C~'b is called left preconditioning since we multiply the system on its 
left by C71. A symmetric idea is to replace Ax = b by the so-called 
right preconditioned system AD~ty = b with x = D~'y and D (eas- 
ily) invertible. Of course, we can mix these two kinds of precondition- 
ing, and solve C-!'AD~!y = C™!b, then compute x by Doty = r. 
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This is interesting if A is symmetric and if we choose C = Dt because 
C~tAD~? is still symmetric. Of course, this preconditioning is efficient 
when cond(C~!AD~') « cond(A) and solving Dx = y is easy. 


We shall return to preconditioning in Chapter 9, which features more efficient 
preconditioners. 


5.4 Exercises 


5.1. Floating-point representation, floating-point arithmetic. 
Run the following instructions and comment the results. 


1. Floating-point accuracy (machine precision). 


a=eps;b=0.5*eps;X=[2, 1;2, 1]; 
A=[2, 1;2, 1+a] ;norm(A-X) 
B=[2, 1;2, 1+b] ;norm(X-B) 


2. Floating-point numbers bounds. 
rM=realmax, 1.0001*rM, rm=realmin, .0001*rm 
3. Infinity and “Not a number.” 
A=[1 2 0 3]; B=1./A, isinf(B), C=A.*B 
4. Singular or not? 
A=[1 1; 1 1+eps];inv(A), rank(A) 
B=[1 1; 1 1+.5*eps];inv(B), rank(B) 
5.2 (x). How to solve a triangular system. 


1. Write a function whose heading is function x=ForwSub(A,b) computing 
by forward substitution (Algorithm 5.1) the solution, if it exists, of the 
system Ax = b, where A is a lower triangular square matrix. 

2. Write similarly a function BackSub(A,b) computing the solution of a sys- 
tem whose matrix is upper triangular. 


5.3 (x). How to store a lower triangular matrix. 


1. Write a program StoreL for storing a lower triangular square matrix. 

2. Write a program StoreLpv for computing the product of a lower triangular 
square matrix and a vector. The matrix is given in the form StoreL. 

3. Write a forward substitution program ForwSubL for computing the solu- 
tion of a lower triangular system with matrix given by StoreL. 


5.4. How to store an upper triangular matrix. In the spirit of the previous 
exercise, write programs StoreU, StoreUpv, and ForwSubU for an upper tri- 
angular matrix. 
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5.5. Write a program StoreLpU computing the product of two matrices, the 
first one being lower triangular and the second upper triangular. The matrices 
are given by the programs StoreL and StoreU. 


5.6. We define a matrix A=[1:5;5:9;10:14]. 


1. Compute a matrix Q whose columns form a basis of the null space of At. 
2. (a) Consider b=[5; 9; 4] and the vector x € R defined by the instruc- 
tion x=A\b. Compute x, Ax — b, and Q*b. 
(b) Same question for b=[1; 1; 1]. Compare both cases. 
(c) Justification. Let A be a real matrix of size m x n. Let b € R™. Prove 
the equivalence 
be Im(A) = Qb =0. 


(d) Write a function InTheImage(A,b) whose input arguments are a ma- 
trix A and a vector b and whose output argument is “yes” if b € Im A 
and “no” otherwise. Application: 

A=[1 2 3; 45 6; 7 8 9], b=[1;1;1], then b=[1 ;2;1]. 


5.7. The goal of this exercise is to show that using the Cramer formulas is a 
bad idea for solving the linear system Ax = b, where A is a nonsingular n x n 
matrix and b € R”. Denoting by aj,..., a, the columns of A, Cramer’s formula 
for the entry x; of the solution x is 7; = det (a1].. . |a;—1|b]ai+1] - - - |an)/ det A 
(see Proposition 5.1.1). Write a function Cramer computing the solution x by 
means of Cramer’s formulas and compare the resulting solution with that 
obtained by the instruction A\b. 

Hint. Use the Matlab function det for computing the determinant of a matrix. 
Application: for n = 20,40,60,80,... consider the matrix A and vector b 
defined by the instructions 


b=ones(n,1);c=1:n; A=c’*ones(size(c));A=A+A’; 
s=norm(A,’inf’); for i=i:n, A(i,i)=s;end; 


Conclude about the efficiency of this method. 


5.8. Let A and B be two matrices defined by the instructions 
n=10;B=rand(n,n) ;A=[eye(size(B)) B; zeros(size(B)) eye(size(B))]; 


Compute the Frobenius norm of B as well as the condition number of A 
(in the Frobenius norm). Compare the two quantities for various values of n. 
Justify the observations. 


5.9. The goal of this exercise is to empirically determine the asymptotic be- 
havior of cond2(H,,) as n goes to oo, where Hp E€ M,,(R) is the Hilbert matrix 
of order n, defined by its entries (H,,);,; = 1/(¢+j—1). Compute cond2(Hs5), 
cond2(Hj9). What do you notice? For n varying from 2 to 10, plot the curve 
n ++ In(cond2(H;,)). Draw conclusions about the experimental asymptotic 
behavior. 


94 5 Linear Systems 


5.10. Write a function Lnorm that computes the approximate 2-norm of a 
lower triangular matrix by Algorithm 5.2. Compare its result with the norm 
computed by Matlab. 


5.11. Write a function LnormAm1 that computes the approximate 2-norm of 
the inverse of a lower triangular matrix by Algorithm 5.3. Compare its result 
with the norm computed by Matlab. 


5.12. Write a function Lcond that computes an approximate 2-norm condi- 
tioning of a lower triangular matrix by Algorithm 5.4. Compare its result with 
the conditioning computed by Matlab. 


5.13 (*). The goal of this exercise is to implement Hager’s algorithm for 
computing an approximate value of cond ;(A). We denote by S = {x € 
R”, Jæl|ı = 1} the unit sphere of R” for the l-norm, and for x € R”, 
we set f(x) = ||A~!a||, with A € M,,(R) a nonsingular square matrix. The 
l-norm conditioning is thus given by 


cond, (A) = |All max f(x). 


1. Explain how to determine || A]|1. 

2. Prove that f attains its maximum value at one of the vectors ej of the 
canonical basis of R”. 

3. From now on, for a given x € R”, we denote by & the solution of Af = x 
and by Z the solution of Atz = s, where s is the “sign” vector of z, defined 
by si = —1 if a; < 0, s; = 0 if z; = 0, and s; = 1 if x; > 0. Prove that 
f(a) = (2,8). 

4. Prove that for any a € R”, we have f(x) +T'(a— x) < f(a). 

. Show that if ; > (x, g) for some index j, then f(e;j) > f(a). 

6. Assume that j Æ 0 for all j. 

(a) Show that for y close enough to x, we have f(y) = f(x)+s'A~t(y—2). 
(b) Show that if ||Z||.. < (x, z), then x is a local maximum of f on the 
unit sphere S. 

7. Deduce from the previous questions an algorithm for computing the 1- 
norm conditioning of a matrix. 

8. Program this algorithm (function Cond1). Compare its result with the 
conditioning computed by Matlab. 


Or 


5.14. We define n x n matrices C, D, and E by 
C=NonsingularMat (n) ;D=rand(m,n) ;E=D*inv(C)*D’ ; 
We also define (n + m) x (n + m) block matrices A and M 
A=[C D’;D zeros(m,m)];M=[C zeros(n,m);zeros(m,n) E]; 


1. For different values of n, compute the spectrum of M~!A. What do you 
notice? 

2. What is the point in replacing system Ax = b by the equivalent system 
M-tAx = M71? 
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3. We now want to give a rigorous explanation of the numerical results of 
the first question. We assume that A E€ M,y4m(R) is a nonsingular matrix 


that admits the block structure A = (z a where C € M,,(R) and 


D € Mm,n(R) are such that C and DC~'D* are nonsingular too. 
(a) Show that the assumption “A is nonsingular” implies m < n. 
(b) Show that for m = n, the matrix D is invertible. 

4. From now on, we assume m < n. Let x = (%1,22)' be the solution of the 
system Ax = b = (b1,b2)'. The matrix D is not assumed to be invertible, 
so that we cannot first compute x, by relation Dx; = bg, then zə by 
Cx, + Dtzxə = bı. Therefore, the relation Dx, = bz has to be considered 
as a constraint to be satisfied by the solutions 71, x2 of the system Ca, + 
Dtx = bı. We study the preconditioning of the system Ax = b by the 


: —1 : Cc 0 
matrix M~- with M = (6 peer) 


(a) Let A be an eigenvalue of M~1A and (u,v) € R"*™ a corresponding 
eigenvector. Prove that (A? — à — 1)Du = 0. 

(b) Deduce the spectrum of the matrix M~'A. 

(c) Compute the 2-norm conditioning of M~1A, assuming that it is a 
symmetric matrix. 


5.15 (*). Program the polynomial preconditioning algorithm presented in 
Section 5.3.5 on page 91 (function PrecondP). 


5.16 (*). The goal of this exercise is to study the numerical solution of the lin- 
ear system that stems from the finite difference approximation of the Laplace 
equation. According to Section 1.1, the Laplace equation is the following 
second-order differential equation: 


—u” (x) + c(a)u(x) = f(x), 
E =0, u(1)=0, (5.18) 


where u : [0,1] — R denotes the solution (which is assumed to exist and be 
unique) and f and c are given functions. We first study the case c = 0. We 
recall that a finite difference discretization at points x, = k/n, k = 1,...,n—1, 
leads to the linear system 

Anu™ = b, (5.19) 
where An E€ Mn—1(R) is the matrix defined by (5.12), b(™ € R"~! is the right- 
hand side with entries (f (x;))ı<i<n-1, and ul”) € R”?! is the discrete solution 
approximating the exact solution at the points xz, i.e., u™® = (u1,...,Un—1)* 
with up © u(x,). We also recall that the n — 1 eigenvalues of A, are given by 


Ne = 4n? sin? (Gor a ee (5.20) 
1. Computation of the matrix and right-hand side of (5.19). 


(a) Write a function LaplacianidD(n) with input argument n and output 
argument An. 
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(b) Write a function InitRHS(n) with input argument n and output ar- 

gument b. 
2. Validation. 

(a) Give the exact solution u°(x) of problem (5.18) when the function f 
is constant, equal to 1. Write a function constructing the vector uy, = 
(ŭe (x1), ...,U°(ap,_1))*. Solve system (5.19) by Matlab. Compare the 
vectors u(” and u$. Explain. 

(b) Convergence of the method. We choose 


ù (x) = (x—1)sin(10x) and f(x) =—20cos(10 x)+100(x—1)sin(10 x). 


Plot the norm of the error u™) —u£ in terms of n. What do you notice? 
3. Eigenvalues and eigenvectors of the matrix Ay. 

(a) Compare the eigenvalues of A, with those of the operator u +> —u” 
endowed with the boundary conditions defined by (5.18). These are 
real numbers A for which we can find nonzero functions y satisfying 
the boundary conditions and such that —y” = Ay. 

(b) Numerically compute the eigenvalues of A,, with Matlab and check 
that the results are close to the values given by formula (5.20). 

(c) Plot the 2-norm conditioning of An in terms of n. Comment. 

4. We now assume that the function c is constant but nonzero. 

(a) Give a formula for the new finite difference matrix An. How do its 
eigenvalues depend on the constant c? 

(b) From now on we fix n = 100, and c is chosen equal to the negative 
of the first eigenvalue of An. Solve the linear system associated to 
the matrix An and a right-hand side with constant entries equal to 1. 
Check the result. Explain. 


5.17. Reproduce Figures 1.2 and 1.3 of Chapter 1. Recall that these figures 
display the approximation in the least squares sense of the values specified in 
Table 1.1 by a first-degree polynomial and a fourth-degree one respectively. 


5.18. Define f(x) = sin(x) — sin(2a) and let X be an array of 100 entries 
(a;)"_, chosen randomly between 0 and 4 by the function rand. Sort this 
array in increasing order using the function sort. 


1. Find an approximation of f in the least squares sense by a second-degree 
polynomial p. Compute the discrete error yX; |J (xi) — p(@:)|?. 

2. Find another approximation of f in the least squares sense by a trigono- 
metric function g(x) = a + bcos(x) + csin(x). Compare q and p. 


6 


Direct Methods for Linear Systems 


This chapter is devoted to the solution of systems of linear equations of the 
form 


Ax =b, (6.1) 


where A is a nonsingular square matrix with real entries, b is a vector called 
the “right-hand side,” and x is the unknown vector. For simplicity, we invari- 
ably assume that A € M,,(R) and b € R”. We call a method that allows for 
computing the solution x within a finite number of operations (in exact arith- 
metic) a direct method for solving the linear system Ax = b. In this chapter, 
we shall study some direct methods that are much more efficient than the 
Cramer formulas in Chapter 5. The first method is the celebrated Gaussian 
elimination method, which reduces any linear system to a triangular one. The 
other methods rely on the factorization of the matrix A as a product of two 
matrices A = BC. The solution of the system Ax = b is then replaced by the 
solution of two easily invertible systems (the matrices B and C are triangular 
or orthogonal) By = b, and Ca = y. 


6.1 Gaussian Elimination Method 


The main idea behind this method is to reduce the solution of a general 
linear system to one whose matrix is triangular. As a matter of fact, we have 
seen in Chapter 5 that in the case of an upper triangular system (respectively, 
lower), the solution is straightforward by mere back substitution (respectively, 
forward substitution) in the equations. 

Let us recall the Gaussian elimination method through an example. 


Example 6.1.1. Consider the following 4 x 4 system to be solved: 


2x 4y 4z t = 0 
3a +6y +2 -2 = =í; 

x y +22 +3t = 4, (a2) 
r +y -4z + = 2, 
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which can also be written in matrix form as 


2 4 —4 J x 0 
3 6 1 > ee ee 
-1 1 2 3 z} | 4 
1 1 of 1 t 2 


The Gaussian elimination method consists in first, removing x from the sec- 
ond, third, and fourth equations, then y from the third, and fourth equations, 
and finally, z from the fourth equation. Hence, we compute t with the fourth 
equation, then z with the third equation, y with the second equation, and 
lastly, x with the first equation. 

Step 1. We denote by p = 2 the entry 1,1 of the system matrix, we shall call 
it the pivot (of the first step). Substituting 


e the second equation by itself “minus” the first equation multiplied by 3, 
e the third equation by itself “minus” the first equation multiplied by oe 
e the third equation by itself “minus” the first equation multiplied by = 


we get the following system: 


2x +4y -4z +t = 0, 
7z —7t/2 = -7, 

3y 47/2 = A, 

—y —2z 44/2 = 2. 


Step 2. This time around, the pivot (entry 2,2 of the new matrix) is zero. We 
swap the second row and the third one in order to get a nonzero pivot: 


2x +4y —4z +t = 0, 
3y 47/2 = A, 
7z —7t/2 = —-7, 

=y =22 +4/2 = 2. 


The new pivot is p = 3: 


the third equation is unchanged, 
substituting the fourth equation by itself “minus” the second equation 
multiplied by =, we obtain the system 


2x +4y —4z +t = 0, 
3y +7t/2 = 4, 

7z —7t/2 = -7, 

-2z 45/3 = #. 


Step 3. Entry 3,3 of the matrix is nonzero, we set p = 7, and we substitute 
the fourth equation by itself “minus” the third equation multiplied by z 
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2x +4y —Az +t = 0, 
3y +7t/2 = 4, 

7z —7t/2 = -7, 6) 
2/3 = 4. 


The last system is triangular. It is easily solved through back substitution; we 
obtain t = 2, z = 0, y = —1, and z = 1. 


We now present a matrix formalism allowing to convert any matrix (or system 
such as (6.2)) into a triangular matrix (or system such as (6.3)). The idea is to 
find a nonsingular matrix M such that the product MA is upper triangular, 
then to solve through back substitution the triangular system M Az = Mb. 
To implement this idea, the Gaussian elimination method is broken into three 
steps: 


e elimination: computation of a nonsingular matrix M such that MA = T 
is upper triangular; 
right-hand-side update: simultaneous computation of Mb; 
substitution: solving the triangular system Tx = Mb by mere back sub- 
stitution. 


The existence of such a matrix M is ensured by the following result to 
which we shall give a constructive proof that is nothing but the Gaussian 
elimination method itself. 


Theorem 6.1.1 (Gaussian elimination theorem). Let A be a square ma- 
trix (invertible or not). There exists at least one nonsingular matrix M such 
that the matrix T = MA is upper triangular. 


Proof. The outline of the method is as follows: we build a sequence of ma- 
trices A*, for 1 < k < n, in such a way that we go from A! = A to A” =T, 
by successive alterations. The entries of the matrix A” are denoted by 
AY = (a8 s)setgen? 

and the entry aÈ y is called the pivot of A¥ . To pass from A* to A*+1, we 
shall first make sure that the pivot ay p is nonzero. If it is not so, we permute 
the kth row with another row in order to bring a nonzero element into the 
pivot position. The corresponding permutation matrix is denoted by P* (see 
Section 2.2.4 on row permutations). Then, we proceed to the elimination of 
all entries of the kth column below the kth row by linear combinations of the 
current row with the kth row. Namely, we perform the following steps. 

Step 1: We start off with A! = A. We build a matrix A! of the form 


Aap A 


where P! is a permutation matrix such that the new pivot ayy is nonzero. If 
the pivot aj, is nonzero, we do not permute, i.e., we take Pt = J. If at, =0 
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and if there exists an entry in the first column aj, # 0 (with 2 < i < n), 
then we swap the first row with the ith and P! is equal to the elementary 
permutation matrix P(1, i) (we recall that the elementary permutation matrix 
P(i, j) is equal to the identity matrix whose rows ¿į and j have been swapped). 
Next, we multiply A! by the matrix E! defined by 


this removes all the entries of the first column but the first. We set 


>I ~1 


Qt ---@in 
0 
A= EA’ a; 5 
0 
. 2 ~1 GERSI n 42 
with aj; = 4;; — aay; for 2 < i,j < n. The matrix A“ has therefore a first 
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column with only zeros below its diagonal. 

During the permutation step, it may happen that all the elements of the 
first column a}, vanish in which case it is not possible to find a nonzero 
pivot. This is not a problem, since this first column has already the desired 
properties of having zeros below its diagonal! We merely carry on with the 
next step by setting A? = A! and Et = P! = J. Such an instance occurs only 
if A is singular; otherwise, its first column is inevitably nonzero. 

Step K: We assume that A* has its (k — 1) first columns with zeros below 
its diagonal. We multiply A? by a permutation matrix P* to obtain 


Ak = pk ak 
such that its pivot af „ is nonzero. If af , 4 0, then we take P* = I. Otherwise, 


there exists ar, Æ 0 with i > k +1, so we swap the kth row with the ith by 
taking P! = P(i, k). Next, we multiply A® by a matrix E* defined by 
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1 
0° 
1 
=k 
a 
k — Īk+1,k 
= =k 
E akk y 
sk 
a 
0 =f 1 
k,k 


aia bed ikd ain 
0 
Dot. xk xk ~k 
Akt! — pgr Ak — $ "Akk k k+1 +++ Gn 
k+1 k+1 
0 Aid ksi °t Ok+l,n 
k+1 k+1 
0... O anki e Onin 


with c= = ak, — zea jy fork+1<i,j <n. The matrix A**1 has its first 


k columns with only zeros below the diagonal. During the permutation Ka 
it may happen that all elements of the kth column below the diagonal, a k 
with i > k, are zeros. Then, this kth column has already the desired form 
and there is nothing to be done! We carry on with the next step by setting 
Ak+1 — AF and E* = P® = I. Such an instance occurs only if A is singular. 
Indeed, if A is nonsingular, then so is, A’, and its kth column cannot have 
zeros from the kth line to the last one, since its determinant would then be 
Zero. 
After (n — 1) steps, the matrix A” is upper triangular: 


eee 


A” = (EP pe g - EP )A, 


We set M = E"-!p"-!... £'P!. It is indeed a nonsingular matrix, since 


n-1 
det M = || det E’ det P’, 


i=l 


with det P? = +1 and det EŻ = 1. 

We can refresh the right-hand side (that is, compute Mb) sequentially 
while computing the matrices P* and E*. We build a sequence of right-hand 
sides (b")1<p<n defined by 
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bt =b, btl = EP, forl<k<n-1, 


which satisfies b” = Mb in the end. 

To solve the linear system Ax = b, it suffices now to solve the system 
A”x = Mb, where A” is an upper triangular matrix. If A is singular, we can 
still perform the elimination step, that is, compute A”. However, there is no 
guarantee that we can solve the system A"x = Mb, since one of the diagonal 
entries of A” is zero. 


Remark 6.1.1. The proof of Theorem 6.1.1 is indeed exactly the Gaussian elim- 
ination method that is used in practice. It is therefore important to emphasize 
some practical details. 


1. We never compute M! We need not multiply matrices Et and P’ to de- 
termine Mb and A”. 

2. If A is singular, one of the diagonal entries of A” = T is zero. As a result, 
we cannot always solve Tx = Mb. Even so, elimination is still possible. 

3. At step k, we only modify rows from k +1 to n between columns k +1 to 
n. 

4. A byproduct of Gaussian elimination is the easy computation of the deter- 
minant of A. Actually, we have det A = + det T depending on the number 
of performed permutations. 

5. In order to obtain better numerical stability in computer calculations, 
we may choose the pivot ak p in a clever way. To avoid the spreading of 
rounding errors, the largest possible pivot (in absolute value) is preferred. 
The same selection can be done when the usual pivot ak k ÎS nonzero, Le., 


we swap rows and/or columns to substitute it with a larger pivot af ,. We 
call the process of choosing the largest possible pivot in the kth column 
under the diagonal (as we did in the above proof) partial pivoting. We 
call the process of choosing the largest possible pivot in the lower diagonal 
submatrix of size (n — k + 1) x (n — k + 1) formed by the intersection of 
the last (n — k + 1) rows and the last (n — k + 1) columns (in such a 
case, we swap rows and columns) complete pivoting. These two variants 
of Gaussian elimination are studied in Exercise 6.3. 


Let us go back to Example 6.1.1 and describe the different steps with the 
matrix formalism. The initial system reads as Ax = b with 


2 4 -4 1 0 
3 6 1 -2 =7 
n= -1 1 2 Se a g= 4 
1 1 —4 1 2 


Set A; = A and bı = b. In the first step, the pivot is nonzero so we take 
P, = L, 
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1 0 0 0 2 4 -4 1 
-3 1 0 0 0 0 7 at 
E; = 1 , 4A = EP A = 7, 
“1 i ; i i i 3 i 
2 > E 2 
and 
0 
—7 
2 
In the second step, the pivot is zero, so we take 
1 0 0 0 1 0 0 0 
0 0 1 Q 0 1 0 0 
Pm=1o 10 0}? @=l0 01 0 
000 1 0 4 0 1 
and 
2 4 —4 1 0 
03 0 í 4 
Ag = E2P2 A2 = 27 |, b3 = E2Pzb2 = 
ER 
0 0 -2 3 as 


In the third step, the pivot is nonzero, so we take P; = J, 


1 0 0 0 2 4 -4 1 
0 10 0 03 0 Z 
= = = 2 
E3 = 0 0 1 0 3 Ay = E3 P3 A3 ER 0 0 7 -1 3 
0 0 2 1 0 0 O 2 
and 
0 
b4 = E3P3b3 = = 
4 
3 


The solution x is computed by solving the upper triangular system Aya = ba, 
and we obtain x = (1, —1,0,2)°. 


6.2 LU Decomposition Method 


The LU decomposition method consists in factorizing A into a product of two 
triangular matrices 
A= LU, 


where L is lower triangular and U is upper triangular. This decomposition 
allows us to reduce the solution of the system Ax = b to solving two triangular 
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systems Ly = b and Ux = y. It turns out to be nothing else than Gaussian 
elimination in the case without pivoting. 
The matrices defined by 


Ar = 
Qk, TE Qk,k 


are called the diagonal submatrices of order k of A € Mn. The next result 
gives a sufficient condition on the matrix A to have no permutation during 
Gaussian elimination. 


Theorem 6.2.1 (LU factorization). Let A = (ai,;)1<i,j<n be a matrix of 
order n all of whose diagonal submatrices of order k are nonsingular. There 
exists a unique pair of matrices (L,U), with U upper triangular and L lower 
triangular with a unit diagonal (i.e., lii = 1), such that A = LU. 


Remark 6.2.1. The condition stipulated by the theorem is often satisfied in 
practice. For example, it holds true if A is positive definite, i.e., 


z’Ac>0, Va #0. 


Indeed, if A* were singular, then there would exist a vector 


Tı Tı 
#0 suchthat A* : | =0. 
Tk Tk 
Now let xo be the vector whose k first entries are (x1,..., £), and whose last 


n — k entries are zero. We have 
zjAzo=0 and zx £0, 


which contradicts the assumption that A is positive definite. Consequently, A* 
is nonsingular. Note that the converse is not true: namely, a matrix A, such 
that all its diagonal submatrices are nonsingular is not necessarily positive 
definite, as in the following instance: 


Ga) 


Hence the assumption of Theorem 6.2.1 is more general than positive defi- 
niteness. 


Proof of Theorem 6.2.1. Assume that during Gaussian elimination, there is 
no need to permute in order to change the pivot, that is, all natural pivots 
ak p are nonzero. Then we have 
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AX = Bn BA, 


with 
1 
0° 
1 
k 
E = —lk+1,k 1 , 
0 —lnk 1 
and fork+1<i<n, 
ar 
lik k : 
akk 


Set U = A” and L = (E')~!..-.(£"~")~1, so that we have A = LU. We need 
to check that L is indeed lower triangular. A simple computation shows that 
(E*)—" is easily deduced from E* by changing the sign of the entries below 
the diagonal: 


1 
a? 
Jl 
Ryo i 
(E") = +k+1,k 1 
0 +lnk 1 


Another computation shows that L is lower triangular and that its kth column 
is the kth column of (E*)~?: 


lai se lal 


It remains to prove that the pivots do not vanish under the assumption made 
on the submatrices A’. We do so by induction. The first pivot a1,1 is nonzero, 
since it is equal to A!, which is nonsingular. We assume the first k — 1 pivots 
to be nonzero. We have to show that the next pivot ay k is nonzero too. Since 
the first k — 1 pivots are nonzero, we have computed without permutation 
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the matrix A*, which is given by (E!)~!---(E*~1)-1A* = A. We write this 
equality with block matrices: 


Lt, 0 Ut, Afo) _ / A® Ao 
L5 I) \A§, Abo] \ 421 422]? 


where Ut, DE y and A* are square blocks of size k, and Ak o, I, and A22 
square blocks of size n — k. Applying the block matrix product rule yields 


k rrk _ Ak 
Li Ur, =A ) 


where Ut, is an upper triangular matrix, and D}: is a lower triangular matrix 
with 1 on the diagonal. We deduce that Uf, = (LẸ 1)7+A* is nonsingular as 
a product of two nonsingular matrices. Its determinant is therefore nonzero, 


k 
det Ui = [[ 4: = 0, 


i=l 


which implies that the pivot ak g at the kth step is nonzero. 
Finally, let us check the uniqueness of the decomposition. Let there be two 
LU factorizations of A: 
A = LU, = L2U2. 
We infer that 
fe iy = U0; 


where Lz Li is lower triangular, and U2U, l is upper triangular. By virtue 
of Lemma 2.2.5, the inverse and product of two upper (respectively, lower) 
triangular matrices are upper (respectively, lower) triangular too. Hence, both 
matrices are diagonal, and since the diagonal of L3 1 Lı consists of 1’s, we have 


fires = UU =], 


which proves the uniqueness. 


Determinant of a matrix. As for Gaussian elimination, a byproduct of the 
LU factorization is the computation of the determinant of the matrix A, since 
det A = det U. As we shall check in Section 6.2.3, it is a much more efficient 
method than the usual determinant formula (cf. Definition 2.2.8). 


Conditioning of a matrix. Knowing the LU decomposition of A yields an 
easy upper bound on its conditioning, cond(A) < cond(L) cond(U), where 
the conditionings of triangular matrices are computed by Algorithm 5.4. 


Incomplete LU preconditioning. Since the LU factorization provides a 
way of computing A~!, an approximate LU factorization yields an approxi- 
mation of AT! that can be used as a preconditioner. A common approximation 
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is the so-called incomplete LU factorization, which computes triangular matri- 
ces L and U such that LU is an approximation of A that is cheap to compute. 
To have a fast algorithm for obtaining L and U we modify the standard LU 
algorithm as follows: the entries L;,; and Ü; j j are computed only if the element 
Ai, j is nonzero (or larger than a certain threshold). The sparse structure of A 
is thus conserved; see Exercise 6.10. 


6.2.1 Practical Computation of the LU Factorization 


A practical way of computing the LU factorization (if it exists) of a matrix A 
is to set 


il 0 PEA 0 U1,1 Eea  eerse Ul,n 
Ta l21 , U= 0 U2,2 f 
; 0 : 
ln eae lament 1 0 cee. 0 Un,n 


and then identify the product LU with A. Since L is lower triangular and U 
upper triangular, for 1 < i,j < n, it entails 


min(i,j) 


n 
age Lg = XO Lp 
k=1 


k=1 


A simple algorithm is thus to read in increasing order the columns of A and 
to deduce the entries of the columns of L and U. 
Column 1. We fix j = 1 and vary i: 


a4 = liiu > ui = aii; 


a2,1. 
@z1 = laita > la = Ona? 
An,1 
Qn =r > lna = ai 


We have thereby computed all the entries of the first column of L and of the 
first column of U. 

Column j. We assume that we have computed the first (j — 1) columns of L 
and U. Then we read the jth column of A: 


dij = uy > Uj = a,j; 
25 = l2,1U1,j + l2,2U2,j = U2, j = 02,j — 12,141,353 
Ta ait iea yer at Se = NOL z 
ajj = lj 1U1,j + + lj jujj => Ujj = 94,5 D lj kUk,j; 
1. 
aiti lj+1,kUk,j 


7 


5419 = lj+1,1U1, Eeee ljagu > j+ = one 


get 
anj—> ln,kUk,j 
, ka RT, 


Ugg 


anj = ln 1U, F001 ln jujj > ln j = 
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We compute in this way the first 7 entries of the jth column of U and the last 
n— j entries of the jth column of L in terms of the first (j — 1) columns. Note 
the division by the pivot uj j, which should not be zero. 


Example 6.2.1. The LU factorization of the matrix in Example 6.1.1, by swap- 
ping its second and third rows, is 


0 
0 
1 


roocoe 
ooow 


6.2.2 Numerical Algorithm 


We now write in pseudolanguage the algorithm corresponding to the LU fac- 
torization. We have just seen that the matrix A is scanned column by column. 
At the kth step we change its kth column so that the entries below the diag- 
onal vanish by performing linear combinations of the kth row with every row 
from the (k + 1)th to the nth. At the kth step, the first k rows and the first 
k — 1 columns of the matrix are no longer modified. We exploit this property 
in order to store in the same array, which initially contains the matrix A, the 
matrices A” and Lk = (E1)~1---(E*~1)—!. More precisely, the zeros of A* in 
its first (k — 1) columns below the diagonal are replaced by the corresponding 
nontrivial entries of L*, which all lie below the diagonal in the first (k — 1) 
columns. At the end of the process, the array will contain the two triangular 
matrices L and U (i.e., the lower part of L without the diagonal of 1’s and 
the upper part of U). We implement this idea in Algorithm 6.1, where the 
columns of L are also precomputed before the linear combinations of rows, 
which saves some operations. 


Data: A. Output: A containing U and L (but its diagonal) 


For k=1/n-1 step k 
Fori=k+1/n row i 
Qik = g new column of L 


For j = k+1 Pa 
Qij = Qij — Ai,kakj combination of rows i and k 
End j 
End i 
End k 


Algorithm 6.1: LU factorization algorithm. 


6.2.3 Operation Count 


To assess the efficiency of the LU factorization algorithm, we count the number 
of operations N,,(n) its execution requires (which will be proportional to the 
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running time on a computer). We do not accurately determine the number of 
operations, and we content ourselves with the first-order term of its asymptotic 
expansion when the dimension n is large. Moreover, for simplicity, we count 
only multiplications and divisions, but not additions, as usual (see Section 
4.2). 


X LU factorization: the number of operations is 


Nop) => SS LEDINE 


k=1i=k+1 j=k+1 
which, to first order yields 


n 
Nop(n) ye a 


X Back substitution (on the triangular system): the number of operations is 
Nop(n) = 5 qe 


X Solution of a linear system Ax = b: an LU factorization of A is followed by 
two substitutions, Ly = b, then Ux = y. Since n? is negligible compared 
to n? when n is large, the number of operations is 


n> 


Nop(n) ye 3: 


The LU method (or Gaussian elimination) is therefore much more efficient 
than Cramer’s formulas for solving a linear system. Once the LU factorization 
of the matrix is performed, we can easily compute its determinant, as well as 
its inverse matrix. 


X Computing det A: we compute the determinant of U (that of L is equal 
to 1), which requires only the product of the diagonal entries of U (n — 1 
multiplications). As a consequence, the number of operations is again 


n> 


Nop(n) © T 
X Computing A~!: the columns of A~', denoted by 2;, are the solutions 
of the n systems Ax; = e;, where (€;)1<i<n is the canonical basis of R”. 
A naive count of operations for computing AT! is 4n°/3, which is the 
sum of n?/3 for a single LU factorization and n° for solving 2n triangular 
systems. We can improve this number by taking into account the fact that 
the basis vectors e; have many zero entries, which decreases the cost of 
the forward substitution step with L, because the solution of Ly; = e; has 
its first (i — 1) entries equal to zero. The number of operations becomes 


n? T j? n2 
Nop(n) S 3 eden (S) ane 
j=l 
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6.2.4 The Case of Band Matrices 


Band matrices appear in many applications (such as the discretization of 
partial differential equations, see Section 1.1) and their special structure allows 
us to spare memory and computational cost during the LU factorization. 


Definition 6.2.1. A matriz A E M,(R) satisfying aij = 0 for |i — j| > p 
with p E N is said to be a band matrix of bandwidth 2p + 1. 


For instance, a tridiagonal matrix has half-bandwidth p = 1. 


Proposition 6.2.1. The LU factorization preserves the band structure of ma- 
trices. 


Proof. Let A be a band matrix of bandwidth 2p + 1 and let A = LU be 
its LU decomposition. We want to prove that L and U are band matrices of 
bandwidth 2p + 1 too. By definition we have 


min(i,j) 
tij = X li,kUk,j. 
k=1 
We proceed by induction on i= 1,...,n. For i = 1, we have 


° ig = Iu; = U1,j;; We infer that uij = 0 for all j >p+ 1; 
° a1 = lj 1u1,1; in particular, a1 = livia, which entails that uii = 
a1 #0, so ljı = a;,1/a1,1, which shows that ljı = 0 for all j > p+ 1. 


Assume now that for alli = 1,...,I — 1, we have 
and let us prove that this property holds for i = I. 
e For j > I + p, we have 


va I-1 
arg = ` li kük, j = lr rur, j + ` lI kuk, j. 
k=l k=l 


Observing that for all k = 1,...,I—1, we have j > I+p > k+1+p > k+p, 
we deduce, by application of the induction hypothesis, that uz; = 0, 
j > k+ p. This implies that az; = ur j, hence proving that uz j = 0 for 
all j >I+p. 

e Similarly, for j > I + p, we have 


I-1 
ajr = yur + X Uj ek = lj, 1U1,1- 
k=1 


The proof of Theorem 6.2.1 has shown that uz z # 0. Thus lj z = aj,r/ur,r, 
which implies that l; z = 0 for j > I + p. 
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So the matrices L and U have the same band structure as the matrix A. 


Remark 6.2.2. Proposition 6.2.1 does not state that the matrices A, on the 
one hand, and L and U, on the other, have the same “sparse” structure. It 
may happen that the band of A is sparse (contains a lot of zeros), whereas 
the bands of L and U are full. We say that the LU decomposition has filled 
the matrix’s band; see Exercise 6.8. Reducing the bandwidth of a matrix is a 
good way of minimizing the computational and the memory requirements of 
its LU factorization; see Exercise 6.9. 


Example 6.2.2. To compute the LU factorization of the matrix 


1 2 0 0 0 
2 6 1 0 0 
0 0 -9 1 2 
00 0 —4 3 
we look for the matrices L and U: 
1 0 000 e f 0 0 0 
a 1000 0g h0 0 
L=| ~p 1i 0 07, U=]0 0i 7 0 
0 0 eT 0 00 0 k l 
000d 1 00 00m 
By identification of the entries of the product A = LU, we obtain 
100 0 0 12 Q 0 0 
2 10 0 0 0 2 1 0 0 
A=|0 11 0 0 0 0 -3 -1 0 
003 1 0 00 0 4 2 
0 0 0 -1 1 0 0 0 0 5 


Storage of a band matrix. To store a band matrix A of half-bandwidth p, 
we use a vector array STOREA of dimension (2p + 1)n. The matrix A is stored 
row by row, starting with the first. Let k be the index such that A(i, j) = 
STOREA (k). To determine k, we notice that the first entry of row i of A 
(i.e., ai i—p) has to be placed in position (i — 1)(2p + 1) + 1, from which we 
deduce that the element a; j has to be stored in STOREA in position k(i, j) = 
(i—1)(2p+1)+j—i+p+1 = (2i—1)p+ j. Be aware that some entries of the 
vector STOREA are not allocated; however, their number is equal to p(p + 1), 
which is negligible; see Exercise 6.5. 
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6.3 Cholesky Method 


The Cholesky method applies only to positive definite real symmetric matri- 
ces. Recall that a real symmetric matrix A is positive definite if all its eigen- 
values are positive. The Cholesky method amounts to factorizing A = BB* 
with B a lower triangular matrix, so that solving the linear system Ax = b 
boils down to two triangular systems By = b and Btg = y. 


Theorem 6.3.1 (Cholesky factorization). Let A be a real symmetric pos- 
itive definite matrix. There exists a unique real lower triangular matrix B, 
having positive diagonal entries, such that 


A= BB". 


Proof. By the LU factorization theorem, there exists a unique pair of matri- 
ces (L,U) satisfying A = LU with 


1 Uri X se X 
cae 

L= and U= 
. ry x 
x Geb OSL Unn 


We introduce the diagonal matrix D = diag (,/u,i). The square root of the 
diagonal entries of U are well defined, since u;,; is positive, as we now show. 
By the same argument concerning the product of block matrices as in the 
proof of Theorem 6.2.1, we have 


k 
IEZ = det At = 0, 


i=l 


with A* the diagonal submatrix of order k of A. Therefore, by induction, each 
ui; is positive. Next, we set B = LD and C = D~'U, so that A = BC. Since 
A= A*, we deduce 

Ce tS BO): 


By virtue of Lemma 2.2.5, C(B*)~+ is upper triangular, while B71O* is lower 
triangular. Both of them are thus diagonal. Furthermore, the diagonal entries 
of B and C are the same. Therefore, all diagonal entries of B~'C* are equal 
to 1. We infer C(B*)~1 = B-1C* = I, which implies that C = B*. 

To prove the uniqueness of the Cholesky factorization, we assume that 
there exist two such factorizations: 


A= B,B* = BoBi. 


Then 
By*B, = B} (BÍ), 
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and Lemma 2.2.5 implies again that there exists a diagonal matrix D = 
diag (d1, .. ., dn) such that B3 'Bı = D. We deduce that Bı = BoD and 


A = B,B3 = Bə(DD*) B4 = BəD?° B3. 


Since B» is nonsingular, it yields D? = I, so d; = +1. However, all the diagonal 
entries of a Cholesky factorization are positive by definition. Therefore d; = 1, 
and Bı = Bə. 


6.3.1 Practical Computation of the Cholesky Factorization 


We now give a practical algorithm for computing the Cholesky factor B for a 
positive definite symmetric matrix A. This algorithm is different from that of 
LU factorization. Take A = (i,j 1<i,j<n and B= (bij 1<ij<n with bij = 0 
if i < j. We identify the entries on both sides of the equality A = BB*. For 
1<i,j7 <n, we get 


min(i,j 


n ) 
ij = 5 bi nbj.h = 5 bi kbj,k- 
k=1 k=1 


By reading, in increasing order, the columns of A (or equivalently its rows, 
since A is symmetric) we derive the entries of the columns of B. 
Column 1. Fix j = 1 and vary i: 


a1 = (b11)? => OLL = fai, 


a21 = b1,1b21 > b21 = T=, 


an,1 
üni = biba => bra = iat 


We have thus determined the entries of the first column of B. 
Column j. We assume that we have computed the first (j — 1) columns of 
B. Then, we read the jth column of A below the diagonal: 


j—1 
ajj = (B31)? + (03,2)? +--+ + (bj)? == yass -Ei (bj): 
TE R T, 
aj+1,j = 05,105 41,1 +0;,2b;41,2 +--+ bj jbj+1,j => b4457= j+1,9 i= j,k ie, 


an,j = bj1bn,1 + bj,2bn,2 + ++ + bj jbn,j > bag = 1H 


We have thus obtained the jth column of B in terms of its first (j—1) columns. 
Theorem 6.3.1 ensures that when A is symmetric positive definite, the terms 
underneath the square roots are positive and the algorithm does not break 
down. 
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Remark 6.8.1. In practice, we don’t need to check whether A is positive def- 
inite before starting the algorithm (we merely verify that it is symmetric). 
Actually, if at step j of the Cholesky algorithm we cannot compute a square 
root because we find that 7, = aj,j — S371 (bjk)? < 0, this proves that A 
is not nonnegative. If we fad that b? jj = 9, which prevents us from com- 
puting the entries b; j for i > j, then A is not positive definite. However, if 
the Cholesky algorithm terminates “without trouble,” we deduce that A is 
positive definite. 


Determinant of a matrix. The Cholesky factorization is also used for com- 
puting the determinant of a matrix A, since det A = (det B)?. 


Conditioning of a matrix. Knowing the Cholesky decomposition of A al- 
lows us to easily compute the 2-norm conditioning of A, which is cond(A)2 = 
cond(BB*) = cond2(B)?, since for any square matrix, || X X*||2 = ||X*X|l2 = 
|| X||3. The conditioning of the triangular matrix B is computed by Algorithm 
5.4. 


Incomplete Cholesky preconditioning. By extending the idea of the in- 
complete LU factorization, we obtain an incomplete Cholesky factorization 
that can be used as a preconditioner. 


6.3.2 Numerical Algorithm 


The Cholesky algorithm is written in a compact form using the array that 
initially contained A and is progressively filled by B. At each step j, only 
the jth column of this array is modified: it contains initially the jth column, 
which is overridden by the jth column of B. Note that it suffices to store the 
lower half of A, since A is symmetric. 


Data: A. Output: A containing B in its lower triangular part 
For j=1/n 
Fork=1/j-1 
ajj = ajj — (ajx)? 
End k 
ajj = ajj 
Fori=j+1/n 
For k=1/j-1 
Qij = Qi,j — Aj,kQi,k 


End k 


Qi j 


Qij = 
End i 
End j 


aij 


Algorithm 6.2: Cholesky Algorithm. 
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6.3.3 Operation Count 


To assess the efficiency of the Cholesky method, we count the number of oper- 
ations its execution requires (which will be proportional to its running time on 
a computer). Once again, we content ourselves with the asymptotic first-order 
term when the dimension n is large. We take into account only multiplications 
and divisions. Although taking a square root is a more expensive operation 
than a multiplication, we neglect them because their number is n, which is 
negligible in comparison to n°. 


e Cholesky factorization: The number of operations is 


n n 3 
$ 3 n 
Nop(n) = >> (= 1)4 5 J r 
j=1 i=j+1 


e Substitution: a forward and a back substitution are performed on the tri- 
angular systems associated with B and B*. The number of operations is 
Nop(n) ~ n?, which is thus negligible compared to the n3/6 of the factor- 
ization. 


The Cholesky method is thus approximately twice as fast as the Gauss method 
for a positive definite symmetric matrix. 


Example 6.3.1. Let us compute the Cholesky factorization of 


1 2 412 
2 13 2 4 
a= 1 2 2 8 
2 4 3 9 
We look for a lower triangular matrix B of the form 


hr 0 0 0 
b21 b2 2 0 0 
63,1 632 ba3 0 
bar 64,2 bag b44 


B= 


The algorithm of Section 6.3.1 yields 


X computing the first column of B: 
eb? =l = nr=; 
e bi, b21 = 2 = b21 = 2, 
e by, b31 == b31 = 1, 
e by, bai =2—> b41 = 2; 

X computing the second column of B: 
e b3, + b3 » =13 => b22 = 3, 
e b3 1b2,1 + b3,2b2,2 = 2 => b32 = 0, 
e b4 1b2,1 + b4,2b2,2 = 4 => b42 = 0; 
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X computing the third column of B: 

e b31 +539 +033 = 2 = b33 = 1, 

e b4,1b3,1 + b4,2b3,2 + b4,3b3,3 = 3 => b43 = 1, 
X computing the fourth column of B: 

e bit bie + big t+ bia =9 => bas = 2. 


Eventually we obtain 


1 0 0 0 
2 3 0 0 
2= 4 gap 
2 0 1 2 


We solve the linear system Ax = b with b = (4,8,5,17)' by the Cholesky 
method. We first determine the solution y of By = b, next the solution x of 
Btx = y. We obtain y = (4,0,1,4)* and x = (1,0, —-1,2)¢. 


By a simple adaptation of the proof of Proposition 6.2.1 we can prove the 
following result. 


Proposition 6.3.1. The Cholesky factorization preserves the band structure 
of matrices. 


Remark 6.3.2. Computing the inverse AT! of a symmetric matrix A by the 
Cholesky method costs n*/2 operations (which improves by a factor of 2 the 
previous result in Section 6.2.3). We first pay n°/6 for the Cholesky factor- 
ization, then compute the columns x; of AT! by solving Ax; = e;. This is 
done in two steps: first solve By; = e;, then solve B*x; = y;. Because of the 
zeros in e;, solving By; = e; costs (n — i)?/2, while because of the symmetry 
of AT}, we need to compute only the (n — i + 1) last components of x;, which 
costs again of the order of (n — i)?/2. The total cost of solving the triangular 
linear systems is thus of order n°/3. The addition of n3/6 and n°/3 yields the 
result Nop(n) = n3/2. 


6.4 QR Factorization Method 


The main idea of the QR factorization is again to reduce a linear system to 
a triangular one. However, the matrix is not factorized as the product of two 
triangular matrices (as previously), but as the product of an upper triangular 
matrix R and an orthogonal (unitary) matrix Q, which, by definition, is easy 
to invert, since Q7! = Q*. 
In order to solve the linear system Ax = b we proceed in three steps. 
(i) Factorization: finding an orthogonal matrix Q such that Q*A = R is 
upper triangular. 

(ii) Updating the right-hand side: computing Q*b. 

(iii) Back substitution: solving the triangular system Rx = Q*b. 
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If A is nonsingular, the existence of such an orthogonal matrix Q is guar- 
anteed by the following result, for which we give a constructive proof by the 
Gram-Schmidt orthonormalization process. 


Theorem 6.4.1 (QR factorization). Let A be a real nonsingular matric. 
There exists a unique pair (Q, R), where Q is an orthogonal matrix and R is 
an upper triangular matriz, whose diagonal entries are positive, satisfying 


A=QR. 


Remark 6.4.1. This factorization will be generalized to rectangular and sin- 
gular square matrices in Section 7.3.3. 


Proof of Theorem 6.4.1. Let ay,...,@n be the column vectors of A. Since 
they form a basis of R” (because A is nonsingular), we apply to them the 
Gram-Schmidt orthonormalization process, which produces an orthonormal 
basis q1,---;@n defined by 


i-1 
j= Qi pn (qk, Gi) Ue (Aree 


ai= (ae, a) gl 2 


We deduce 
q= 5 Tkidk, With Tki = (qk, ai), for1<k<i-l, (6.4) 
k=1 
and 
i—1 
rig = las — $ (Ge, aijax|| > 0. 
k=1 


We set rp; = 0 if k > i, and we denote by R the upper triangular matrix with 
entries (rki). We denote by Q the matrix with columns q,...,@n, which is 
precisely an orthogonal matrix. With this notation, (6.4) is equivalent to 


A=QR. 


To prove the uniqueness of this factorization, we assume that there exist two 
factorizations 
A = Qı Rı = Q2Rə. 


Then Q3Q1 = R2Rz" is upper triangular with positive diagonal entries as a 
product of two upper triangular matrices (see Lemma 2.2.5). Let T = RaR] '. 
We have 


TT* = (Q3Q1)(Q3Q1)* =I. 


Hence T is a Cholesky factorization of the identity, and since it is unique, we 
necessarily have T = I. 
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Remark 6.4.2. The above proof of uniqueness of the QR factorization relies 
crucially on the positivity assumption of the diagonal entries of R. Let us 
investigate the case that the diagonal entries of R have no specific sign (this 
turns out to be useful in the proof of Theorem 10.6.1). Consider two QR 
factorizations of the same nonsingular matrix 


A = Qı Rı = Q2R2. 


The upper triangular matrix RRI! is thus equal to the orthogonal one Q$Q1, 
hence it is diagonal (see the proof of Theorem 2.5.1): there exists a diagonal 
matrix D such that RəRJ' = QQ: = D. That is to say, Rə = DR; and Qı = 
Q2D. The last equality implies | D;,;| = 1. In other words, the QR factorization 
of a real nonsingular matrix is always unique up to the multiplication of each 
column k of Q and each row k of R by the factor re = +1. In the complex case, 
the multiplication factor is a complex number of unit modulus, et, where s 
is a real number. 


Example 6.4.1. The QR factorization of the matrix 


1 =1 2 
A=|-1 1 0 
0 =2 1 


1/vV2_ 0 1/72 v2 -V2 v2 
Q=|-1/V2 0 1/2], R=| 0 2 -1 
0 -1 0 0 0 vZ 


To determine the solution of Az = (—3,1,5)’, we first compute y = Q'b = 


(4, —5v2, —2)', then solve Rg = y to obtain x = (—4, —3, —1)Ż. 


6.4.1 Operation Count 


We assess the efficiency of the Gram-Schmidt algorithm for the QR method 
by counting the number of multiplications that are necessary to its execution. 
The number of square roots is n, which is negligible in this operation count. 


e Gram-Schmidt factorization: the number of operations is 


n 


Nop(n) =X (= 1)(2n) + (n+ 1)) = në. 
i=1 
e Updating the right-hand side: to compute the matrix-vector product Q*b 
requires Nop(n) = n?. 
e Back substitution: to solve the triangular system associated with R re- 
quires Nop ~ n?/2. 
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The Gram-Schmidt algorithm for the QR method is thus three times slower 
than Gaussian elimination. It is therefore not used in practice to solve linear 
systems. Nonetheless, the QR method may be generalized, and it is useful for 
solving least squares fitting problems (see Chapter 7). 


Remark 6.4.8. In numerical practice the Gram-Schmidt procedure is not used 
to find the QR factorization of a matrix because it is an unstable algorithm 
(rounding errors prevent the matrix Q from being exactly orthogonal). We 
shall see in the next chapter a better algorithm, known as the Householder 
algorithm, to compute the QR factorization of a matrix. 


Conditioning of a matrix. If one knows the QR factorization of a matrix A, 
its 2-norm conditioning is easy to compute, since condg(A) = cond2(QR) = 
cond2(R) because Q is unitary. 


6.5 Exercises 


6.1. We define a matrix A=[1 2 3; 4 5 6; 7 8 9]. Compute its determi- 
nant using the Matlab function det. Explain why the result is not an integer. 


6.2. The goal of this exercise is to compare the performances of the LU and 
Cholesky methods. 


1. Write a function LUfacto returning the matrices L and U determined via 
Algorithm 6.1. If the algorithm cannot be executed (division by 0), return 
an error message. 

2. Write a function Cholesky returning the matrix B computed by Algo- 
rithm 6.2. If the algorithm cannot be executed (nonsymmetric matrix, 
division by 0, negative square root), return an error message. Compare 
with the Matlab function chol. 

3. For n = 10,20,...,100, we define a matrix A=MatSdp(n) (see Exercise 
2.20) and a vector b=ones(n, 1). Compare: 

e On the one hand, the running time for computing the matrices L and 
U given by the function LUFacto, then the solution x of the system 
Ax = b. Use the functions BackSub and ForwSub defined in Exercise 
5.2. 

e On the other hand, the running time for computing the matrix B given 
by the function Cholesky, then the solution x of the system Ax = b. 
Use the functions BackSub and ForwSub. 

Plot on the same graph the curves representing the running times in terms 

of n. Comment. 


6.3 (x). The goal of this exercise is to program the following variants of the 
Gauss algorithm: 
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e the Gauss algorithm with partial pivoting (by row), which consists, at each 
step k of the Gauss elimination, in determining an index ip (k < io < n) 
such that 


[aio k| = D Jail, (6.5) 


then swapping rows k and io, 
e the Gauss algorithm with complete pivoting, which consists in determining 
indices tg and jo (k < ig, jo < n), such that 


[io.dol = max laisl (6.6) 


then swapping rows k and io, and columns k and jo. 


Let Ap be the matrix obtained at the end of step k of the Gauss elimination. 
In the first k — 1 columns of Ax, all the entries below the diagonal are zero. 


1. Write a function x=Gauss(A,b) solving the linear system Ax = b by the 
Gauss method outlined in Section 6.1. Recall that if the pivot Af is zero, 


this method permutes row k with the next row i (i > k) such that AS is 
nonzero. 

2. Write a function x=GaussWithoutPivot(A,b) solving the system Ax = b 
by the Gauss method without any pivoting strategy. If the algorithm can- 
not proceed (because of a too-small pivot AMY, return an error message. 

3. Write a function x=GaussPartialPivot(A,b) solving the linear system 
Ax = b by the Gauss method with partial pivoting by row. 

4. Write a function x=GaussCompletePivot(A,b) solving the linear system 
Ax = b by the Gauss method with complete pivoting. 

5. Comparison of the algorithms. 

(a) Check on the following example that it is sometimes necessary to use 
a pivoting strategy. Define the matrix A, and the vectors b and x by 


e 1 1 1 
A=/1 1 -1], x= {-1], and b= Ar. 
1 1 2 1 


For € = 10715, compare the solutions obtained by Gauss(A,b) and 
GaussPartialPivot(A,b). Comment. 
(b) In order to compare the Gauss pivoting algorithms, we define the 
following ratio o, which we shall call growth rate: 
| 
J 
max;,j|Aij| 


Max; j | Al 
O => 


where A(”—) denotes the upper triangular matrix generated by Gauss- 
ian elimination. The growth rate measures the amplification of the ma- 
trix entries during Gauss elimination. For numerical stability reasons, 
the ratio o should not be too large. 
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i. Modify the programs GaussWithoutPivot, GaussPartialPivot, 
GaussCompletePivot, and Gauss to compute respectively the 
rates awp, Qecrs Copp and oa. 

ii. For different values of n, compute the growth rates for the matrices 
A, B, and C defined by 
A=DiagDomMat (n); B=SpdMat (n); C=rand(n,n); 

Conclude. 
iii. Comparison of gp, and Ogep- 

A. For each n = 10k (1 < k < 10), compute o,,, and 0,,, for 
three (or more) matrices randomly generated A=rand(n,n). 
Plot these values on the same graph in terms of the matrix 
dimension n. What do you notice? 

B. For each n = 2k (1 < k < 5), compute gpp and Qg¢p for the 
matrix defined by 
A=-tril(ones(n,n))+2*diag(ones(n,1)); 
A=A+[zeros(n,n-1) [ones(n-1,1);0]]; 

What do you notice? 


6.4. The goal of this exercise is to evaluate the influence of row permutation 
in Gaussian elimination. Let A and b be defined by 
e=1.E-15;A=[e 1 1;1 -1 1; 1 0 1];b=[2 0 1]’; 


1. Compute the matrices L and U given by the function LUFacto of Exercise 
6.2. 

2. We define two matrices l and u by [1 u]=LUFacto(p*A), where p is the 
permutation matrix defined by the instruction [w z p]=lu(A). Display 
the matrices l and u. What do you observe? 

3. Determine the solution of the system Ax = b computed by the instruc- 
tion BackSub (U,ForwdSub(L,b)), then the solution computed by the 
instruction BackSub (u,ForwSub(1,p*b)). Compare with the exact solu- 
tion x = (0,1,1)'. Conclude. 


6.5 (*). 


1. Write a program StoreB to store a band matrix. 
2. Write a program StoreBpv to compute the product of a band matrix with 
a vector. The matrix is given in the form StoreB. 


6.6 («). Write a program LUBand that computes the LU factorization of a 
band matrix given in the form StoreB. The resulting matrices L and U have 
to be returned in the form StoreB. 


6.7. The goal of this exercise is to study the resolution of the finite difference 
discretization of the 2D Laplace equation. For given smooth functions f and 
g we seek a solution u(x, y) of the following partial differential equation: 


—Au(z,y) = f(x,y), for (x,y) € R =]0, 1[x]0, 1], (6.7) 
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together with the boundary condition 
u(x, y) = g(x,y), for (x,y) € ƏN = boundary of 22, (6.8) 


where Au = 0?u/0x? + 0?u/Oy? is the Laplacian of u. As for the one- 
dimensional problem, we discretize the domain 2: given the space step 
h = 1/(N +1) (respectively, k = 1/(14+1)) in the direction of x (respectively, 
y), we define the points 


a; =ih,i=0,...,.N+1, yj =Jjk, j=0,...,M+1. 


The goal is to compute an approximation (u;,;) of u at the points in 2, 
(ziy) 1<i< N,andl<j< WM. 


1. Finite difference approximation of the Laplacian. 
(a) Combining the Taylor expansions of u(x; — h, yj) and u(x; + h, yj), 


show that 
Ou G1, Yj) — 2a Yj) + ulr: 1545) 
Dan (tir Ys) = : ~ IDA + O(h’). 


We say that (ulzi) = 2u(xz;i, Yj) + w(i41,9j)) /h? is a second- 
order approximation of 0?u/0x? at point (xi, y;). 


(b) Same question for 


Oru ulzi, Yj—1) — 2u(a;, yj) + ula, Yj 
Bye inva) = ( j 1) Yj) ( Yj+1) + O(k?). 


(c) Justify the finite difference method for solving the Laplace equation 
(6.7): 


Ui—1,j + Ui j — W415 Uij-i F Wig Vit _ 2 gg 
h2 ar k2 — fij ( ) 


where u; j denotes the approximation of u(z;, yj), and fij = f (vi, yj). 
Formula (6.9) is called the 5-point discretization of the Laplacian, 
because it couples 5 values of u at 5 neighboring points. 

. Taking into account the boundary condition (6.8), formula (6.9) has to be 
modified for the points (x;, yj) close to the boundary 0, that is, for i = 1 
and N, or j = 1 and M. For instance, for j = 1, the term u; j—1 appearing 
in (6.9) is known and equal to g;,9, according to (6.8). Therefore, this term 
moves to the right-hand side of the equality: 


N 


Ui—1,1 + 2Ui1 — Vita | 21 — 


h2 l k2 


2 find m (6.10) 


For į = 1 or i = N, there is yet another term of (6.10) that is known: 


2u11 — u21 — 2U11 — U1,2 91,0 — 90,1 
poo te Srp 
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—uUn-11+2UN1 _ 2UN1 — N N+1,1 
St E fa t E EA, 
Write the corresponding equations for the other points close to the bound- 
ary. 

. We now solve the linear system corresponding to (6.9) and (6.10) and 
assume, for simplicity, that h = k. Let u; be the vector whose entries 
are the n unknowns located on row j, Uj; = (U1,j,U2,j,---,UN,j)’, and 
fi = (fij, fj,- --, fN j)t. Determine the matrix B such that the vectors 
u,; for j = 1,..., M satisfy the equations 


TUj-1 + Buy —Uj+ı _ F 
h? = fy 


For j = 1 or j = M, f; must be modified in order to take into account the 
boundary values u; o and u; m+1, Which are known. For simplicity again, 
we assume g = 0. Prove that the complete system reads 


Au = f, (6.11) 


where the unknown is ù = (ū1,...,ūm)t, the right-hand side is f = 

(fi,.--,fm)*, and the matrix A is to be determined. Exhibit the band 

structure of this matrix. 

(a) Write a function Laplacian2dD(n) returning the matrix A (of order 
n?, where n = N = M). Use the Matlab function spy to visualize the 
matrix A. Hint: we do not request at this stage of the problem to use 
the Matlab instruction sparse. 

(b) Write a function Laplacian2dDRHS(n,f) returning the right-hand 
side f of equation (6.11), given n and the function f defined as a 
Matlab function. 

. Validation. Set f(x,y) = 2%(1 — x) + 2y(1 — y), so that the solution of 

(6.7) is u(x, y) = z(1 — x)y(1 — y). For N = 10, compute the approximate 

solution and compare it with the exact solution, plotting them on the 

same graph using the function plot3. 

. Convergence. We now choose f such that the solution u is not a polyno- 

mial. 

(a) How should one choose f so that the solution is 

u(x, y) = (a — 1)(y — 1) sin(nx) sin(ry)? 

(b) What is the maximal value No for which Matlab can carry out the 
computations (before a memory size problem occurs)? 

(c) Taking into account the sparse nature of the matrix A, we define a 
function Laplacian2dDSparse. The command sparse should be used 
to define and store the matrix A in sparse form: larger problems (i.e., 
with N larger than No) can be solved accordingly. Let Ne be the total 
number of nonzero entries of A. Define three vectors of size Ne: 

e a vector ii of integers containing the indices of the rows of the 
nonzero entries of A; 
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e avector jj of integers containing the indices of the columns of the 
nonzero entries of A; 

e a vector u containing the nonzero entries of A. 

For any k = 1,...,Ne, they satisfy u(k) = Ajin),j;(n)- Next, define a 
matrix spA=sparse(ii,jj,u). For every value N = 5,10,15,...,50, 
compute the error between the numerical solution and the exact so- 
lution. Plot the error in terms of N on a log-log scale. The error is 
computed in the co-norm, i.e., is equal to the maximum of the er- 
ror between the exact and approximate solutions at the N x N mesh 
points. Comment on the results. 

6. Spectrum of A. We fix N = 20. 

(a) Compute (using eig) the eigenvalues and the eigenvectors of A. 

(b) Use the instruction sort to find the four smallest eigenvalues. Plot 
the corresponding eigenvectors (using surfc). Hint: The eigenvectors 
computed by Matlab are vectors of size N x N, which have to be 
represented as a function of (x,y) given on an N x N regular grid. 

(c) The eigenvalue and eigenfunction y of the Laplacian on the unit 
square with homogeneous Dirichlet boundary conditionsare are de- 
fined by a nonidentically zero function y such that 


-^y =y inf 


and (x,y) = 0 for (x,y) € ƏN. For which values a and ĝ is 
(x,y) = sin (az) sin(Gy) an eigenfunction? What is the corresponding 
eigenvalue? Plot on the unit square the first four eigenfunctions of the 
Laplacian, that is, the eigenfunctions corresponding to the smallest 
eigenvalues. Interpret the curves of the previous question. 


6.8. Let A be the matrix defined by A=Laplacian2dD(5), and A = LU its LU 
factorization given by LUFacto. Use the function spy to display the matrices 
L and U. Explain. 


6.9. Let A be a band matrix of order n and half bandwidth p. For n > p> 1 
compute the number of operations Nop(n, p) required for the LU factorization 
(having in mind Proposition 6.2.1). 


6.10. The goal of this exercise is to program the so-called incomplete LU 
factorization of a matrix A, which is defined as the approximate factorization 
A x LU, where L and U are computed by the program LUFacto modified as 
follows: the entries L; „j and U; j are computed if and only if the entry A;,;j is 
not zero. If this entry is zero, we set L; ; = 0 and U;,; = 0. 


1. Write a program ILUfacto computing the incomplete LU factorization of 
a matrix. Because of rounding errors, the condition A; ; = 0 has to be 
replaced by |Aj,;| < €, where £ > 0 is a prescribed small threshold. 

2. For A=Laplacian2dD(10), compute cond2(A) and cond)(U~!L~1 A). Ex- 
plain. 
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Least Squares Problems 


7.1 Motivation 


The origin of the least squares data-fitting problem is the need of a notion 
of “generalized solutions” for a linear system Ar = b that has no solution in 
the classical sense (that is, b does not belong to the range of A). The idea 
is then to look for a vector x such that Ax is “the closest possible” to b. 
Several norms are at hand to measure the distance between Ax and b, but the 
simplest choice (which corresponds to the denomination “least squares” ) is the 
Euclidean vector norm. In other words, a least squares problem amounts to 
finding the solution (possibly nonunique) x € R? to the following minimization 
problem: 

| Azlin = min b- Ayla, (7.1) 


where A € M,,,,(R) is a matrix with n rows and p columns, b is a vector of 
R”, and ||- ||, denotes the Euclidean norm in R”. 

In the square case p = n, if the matrix A is nonsingular, then there exists 
a unique minimizer z = A~'b, and the minimum is equal to zero. In such a 
case, a least squares problem is equivalent to solving a linear system. If A is 
singular or if p Æ n, the notion of least squares yields a generalization of a 
linear system solving to nonsquare or singular matrices. If a solution of the 
linear system Ax = b exists, then it is also a solution of the least squares 
problem. The converse is not true, as we shall see in the following geometrical 
argument. 

The least squares problem (7.1) has a geometrical interpretation as finding 
the orthogonal projection of b on the range of A. Indeed, Az is the closest 
vector in Im(A) to b. A well-known property of the orthogonal projection 
is that b — Az is actually orthogonal to Im (A). We display in Figure 7.1 a 
vector b and its orthogonal projection Ax onto the vector subspace Im (A). It 
is therefore clear that (7.1) always admits at least one solution æ (such that 
Az is the orthogonal projection of b,) although the linear system Ay = b may 
have no solution if b does not belong to Im (A). 
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Fig. 7.1. Least squares problem: projection of b onto Im (A). 


Finally, let us recall that one of the main motivations of least squares 
problems is data-fitting (see Section 1.2). 


7.2 Main Results 


We consider the least squares problem (7.1): find « € R? that minimizes 
|b — Ay||, over R?, where A € Mn p(R) is a matrix with n rows and p 
columns, b € R”, and ||- ||, denotes the Euclidean norm in R”. 


Lemma 7.2.1. A vector x € R? is a solution to the least squares problem 
(7.1) if and only if it satisfies the so-called normal equation 


A* Ax = A*b. (7.2) 
(Observe that A*A is a square matrix of size p.) 
Proof. Let x € R? be a solution of (7.1), i.e., 
lb- Ar|} < llb- Aylin, Yy € RP. 


For any z € R? and any t €E R, set y = x + tz. Then 
|b — Az||? < ||b — Aal||? + 2t(Az — b, Az) + t?||Az|l2. 


We infer 
0 < 2sign (t)(Ax — b, Az) + |¢|||Az||? 


which implies that as t tends to 0 (from above and then from below), 
(Az — b, Az) =0, VzeER?. 


Thus we deduce that A* Ax — A*b = 0. Conversely, if x is a solution of the 
normal equation (7.2), then 


(Az — b, Az) =0, VzeER?. 


Thus 
b — Aa? < llb — Ayl} Vy =a+tz €R’. 


That is, x is also a solution of (7.1). 


7.2 Main Results 127 


Theorem 7.2.1. For any matric A E€ Mn p(R), there always exists at least 
one solution of the normal equation (7.2). Furthermore, this solution is unique 


if and only if Ker A = {0}. 


Proof. If A*A is nonsingular, there exists, of course, a unique solution to the 
normal equation. If it is singular, we now show that there still exists a solution 
that is not unique. Let us prove that A*b € Im A*A, or more generally, that 
Im A* C Im A*A. The opposite inclusion, Im A*A C Im A*, is obvious, as 
well as Ker A C Ker A*A. On the other hand, the relation Im A* = (Ker A)+ 
implies that 

R? = KerA® Im A*. (7.3) 
Moreover, A* A is real symmetric, so is diagonalizable in an orthonormal basis 


of eigenvectors. Since the range and the kernel of a diagonalizable matrix are 
in direct sum, we deduce 


R? = Ker A*A@ Im A*A. (7.4) 


If we can show that Ker A*A C Ker A (and thereby that Ker A = Ker A*A), 
then (7.3) and (7.4), together with the relation Im A*A C Im A*, imply that 
Im A* = Im A*A, which is the desired result. Let us prove that Ker A*A C 
Ker A. If x € Ker A*A, then 


A* Ar =0=> (A* Ax, x) = 0 & ||[Az|| = 0 & Ax =0, 


and thus « € Ker A. This proves the existence of at least one solution. Clearly 
two solutions of the normal equation differ by a vector in Ker A* A, which is 
precisely equal to Ker A. 


A particular solution of the normal equation can be expressed in terms of 
the pseudoinverse At of A (see Definition 2.7.2). 


Proposition 7.2.1. The vector xp = A‘b is a solution of the least squares 
problem (7.1). When (7.1) has several solutions, xp is the unique solution 
with minimal norm, i.e., for alla Æ xp such that || Ax, — b||2 = || Ax — b||2, we 
have 

Ilzoll2 < [l2Il2. 


Proof. For any x € R?, we decompose Ax — b as follows: 
Ag — b= A(z — 25) — (I — AA')b. 


This decomposition is orthogonal since A(x — x,) € Im A and (I — AA')b € 
(Im A)+, because AA? is the orthogonal projection matrix of C™ onto Im A; 
see Exercise 2.29. We deduce from this decomposition that 


|| Ax — bl|2 = || Ax — Asell? + || Ay — dlp > || Ave — bile, (7.5) 


which proves that 2» is a solution of (7.1). In addition, if || Az, — bl|2 = 
|| Aa — b|l2, then (7.5) shows that Ax = Ax, and z = x — a € Ker A. We 
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obtain thus a decomposition of x into x = z+ £p. This decomposition is 
orthogonal, since z € Ker A and z = Ab € (Ker A)*+ (by definition of At in 
Exercise 2.29). Hence, if £ Æ £p, we have 


llè = llel + lleolld > lzel. 


Remark 7.2.1. The vector x, = Atb has a simple geometric characterization: 
it is the unique vector of (Ker A)+ whose image under the matrix A is equal 
to the projection of b onto Im A; for more details see Exercise 2.29. 


7.3 Numerical Algorithms 


Before introducing efficient numerical methods for solving problem (7.2), we 
first study the sensitivity of the solution to variations of the data. 


7.3.1 Conditioning of Least Squares Problems 


In this section we assume that system (7.2) has a unique solution, that is, 
Ker A is reduced to the zero vector. Note that this is possible only if p < n. In 
this case, the square matrix A*A is nonsingular, and by Theorem 7.2.1, the 
least squares problem has a unique solution, equal to Atb. 


Sensitivity of the Solution to Variations of b. 


For a given bọ € R”, we call rọ = Atbo the solution to the least squares 
problem 
in || Ay — boll. f 
min || Ay — bol (7.6) 


Similarly for a given bı € R”, we call x; = Atb; the solution to the least 
squares problem 
a || Ay — bıl. (7.7) 


Before analyzing the variations of x in terms of the variations of b, let us first 
observe that only the projection of the vector b onto Im A counts; it is the 
point of the next remark. 


Remark 7.8.1. As already explained in Section 7.1, a solution « of the least 
squares problem can be obtained by taking Ax as the orthogonal projection 
of b onto Im (A). Therefore, if we modify the vector b without changing its 
orthogonal projection onto Im (A), we preserve the same solution of the least 
squares problem. In other words, if bọ and bı have the same projection z onto 
Im (A), we have Avg = Axı = z. Since the kernel of A is Ker A = {0}, we 
clearly obtain equality between the two solutions, £o = 71. 
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Bound from above of the absolute variation 


We have a direct upper bound of the variations of the solution: 
i 1 
zı — Loll2 = || A" (bı — bo)ll2 < mia — boll2, 
p 


where Hp is the smallest nonzero singular value of the matrix A; see Remark 
5.3.4. 


Bound from above of the relative variation 


Assuming that 29 and bg are nonzero, the relative error on x can be bounded 
in terms of the relative error on b. 


Proposition 7.3.1. Assume that Ker A = {0}. Let bo, bı be the vectors de- 
fined in (7.6), (7.7), and xo, x1 their corresponding solutions. They satisfy 


= b; —b 
zı — zolle <l 1 — bolle 


, 7.8 
las ink 78) 


where 
I|bo]| 


zoll 


Cy = || A" |2 (7.9) 
The constant C, is a measure of the amplification of the relative error on the 
solution x with respect to the relative error on the right-hand side b. This 
constant is the product of several quantities: 


cond(A) 


C = 
ý ncos 0 


? 


where 


X cond(A) = ||All2||A‘||2 is the generalized conditioning of A, 

X 0 denotes the angle formed by the vectors Aro and bo (see Figure 7.1), 
i.e., cos 0 = || Axo|l2/|lbolle, 

X n= |/Alle ||xoll2/|| Axoll2 indicates the gap between the norm of the vector 
Az and the maximal value that can be taken by this norm (it always 
satisfies 7 > 1). 

We single out the following particular cases: 

X if bo € ImA, then 0 = 0 and since 7 > 1, the amplification constant Cy is 
at most equal to cond (A); 

X if bo € (ImA)+, then 6 = 7/2 and z = 0 = Azo. The amplification 
constant Cp is infinite in this case. 
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Sensitivity of the Solution to Variations of A. 


We now vary the matrix A. Let Ap be a reference matrix with Ker Ao = {0}, 
and let 
A: = Ao +EB, BE Mn »(C) (7.10) 


be the matrix of the perturbed problem. For e small enough, Až A, is nonsin- 
gular, and the respective solutions to the reference and perturbed problems 
are denoted by xp and xe. They satisfy 


A* Agro = AXb and A*A-a. = Ab. 

We perform a Taylor expansion for £ close to 0: 
= (At A,)~1Atb = (43 + €B*)(Ap + eB)] "(AR + eB*)b 
e Ao + €(A%B + B* Ao) + Ole ”)] TEAD + eB*b) 
=| 


4—1 
I +e( Aï Ao) (ASB + B*Ao) + O(e?)| (Ab Ao) (Ab + eB") 


= [r — e( Až Ao)“ 1 (A%B + B* Ag) + 0(e?)] [#0 + €(Ap-40) “1B 
Therefore, we deduce that 
Te — £o = (Ag Ap) (€B*)(b — Aozo) — (454o) A3 (€B)xo + Ole”). 
Setting AAo = As — Ao, we get the following upper bound: 


ee * = b— Aoz 
Pe < MAA Aole OE + Aba dolla + 06). 


On the other hand, we have tan 0 = ||b — z||2/||zo||2 and 


1 1 1 
IAT! = = = — = = |4|, 
o min |A| min p? 
AEo(A* A) 


where ø is the smallest singular value of A*A, m; are the singular values of 
A, and we have used the fact that A*A is normal. Hence, we have proved the 
following result: 


Proposition 7.3.2. Assume that Ker Ag = {0}. Let xo and zs be the solu- 
tions to the least squares problems associated with the matrices Ag and A, 
respectively, with the same right-hand side b. The following upper bound holds 
as £ tends to 0: 


ze — zoll2 | Az — Aolle 2 
LO + O(c”), 7.11 
[zoll [fb 7 O°? on 
tan 2 
where C4 = conda (A) + condo (A). 
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Therefore, the relative error on x can be amplified by the factor C4. We single 
out the following particular cases: 


X if bo E€ Im A, then C4 = condg(A); 

X if bo € (Im A)+, the amplification factor C4 is infinite; 

X in all other cases, C4 is usually of the same order as condọ(A)?. Of 
course, if (tan 0)/ņ is very small (much smaller thancond2(A)~'), C'A 
is of the order of cond2(A), while if (tan @)/n is very large, C4 is larger 
than cond2(A)?. 


7.3.2 Normal Equation Method 


Lemma 7.2.1 tells us that the solution to the least squares problem is also a 
solution of the normal equation defined by 


A* Ax = A*b. 


Since A*A is a square matrix of size p, we can apply to this linear system 
the methods for solving linear systems, as seen in Chapter 6. If Ker A = {0}, 
then the matrix A* A is even symmetric and positive definite, so we can apply 
the most efficient algorithm, that is, the Cholesky method. 


Operation Count 


As usual, we count only multiplications and we give an equivalent for n and 
p large. When applying the Cholesky algorithm to the normal equation, the 
following operations are performed: 


e Multiplication of A* by A: it is the product of a p x n matrix by an n x p 
matrix. The matrix A*A is symmetric, so only the upper part has to be 
computed. The number of operations Nop is exactly 


np(p + 1) 
Nop = a `: 


e Cholesky factorization: 


3 
P 
Nop © a 


e Computing the right-hand side: the matrix-vector product A*b costs 
Nop = pn. 
e Substitutions: solving two triangular linear systems costs 


Nop © p. 
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In general, n is much larger than p, which makes the cost of the Cholesky 
factorization marginal with respect to the cost of the matrix product A*A. 
However, this method is not recommended if p is large and the conditioning of 
A is also large. Actually, the amplification of rounding errors, while solving the 
normal equation, is governed by the conditioning of A* A, which is in general 
of the order of the square of the conditioning of A. We shall see other methods 
where the conditioning is simply equal to that of A. 


7.3.3 QR Factorization Method 


The main idea of the QR factorization method is to reduce the problem to 
a least squares problem with a triangular matrix. We thus factorize A as the 
product of a triangular matrix R and an orthogonal (unitary) matrix Q. We 
recall that the multiplication by an orthogonal matrix preserves the Euclidean 
norm of a vector: 


Q2lln = lizm Yz ER” if Q7 = Q". 


Let A € Mn p(R) be a (not necessarily square) matrix. We determine R € 
Mn,.p(R) such that rj; =0 if i < j, and Q € Mn,n(R) such that Q7! = Q*, 
satisfying 

A=QR. 


The original least squares problem (7.1) is then equivalent to the following 
triangular problem: 


*b— Rall, = min ||Q*b— Rylln, 
IQ zla = min ||Q yll 


which is easily solved by a simple back substitution. We first study this method 
based on the Gram-Schmidt procedure (in the next section, we shall see an- 
other more powerful algorithm). We distinguish three cases. 


Case n = p 


If the matrix A is nonsingular, then we know that the solution to the least 
squares problem is unique and equal to the solution of the linear system Ax = 
b. We have seen in Chapter 6 how the QR method is applied to such a system. 
If the matrix A is singular, we need to slightly modify the previous QR method. 
Let a,,...,@n be the column vectors of A. Since these vectors are linearly 
dependent, there exists i such that a,,...,a; are linearly independent, and 
aj11 is generated by a,,...,a;. The Gram-Schmidt procedure (see Theorem 
2.1.1) would stop at the (i + 1)th step, because 


i 


Qi+1 = i41 — X (ar, ai+1)qk = 0. 
k=1 
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Indeed, since the subspaces span {qi,...,q;} and span {a,,...,a;} are equal 
and a;4+1 belongs to the latter one, the orthogonal projection of a;+1 onto 
span {q1,..., qi} is equal to a;41, and ãi+ı vanishes. To avoid this difficulty, 
we first swap the columns of A to bring into the first positions the linearly 
independent columns of A. In other words, we multiply A by a permutation 
matrix P such that the rk(A) first columns of AP are linearly independent 
and the n — rk (A) last columns of AP are spanned by the rk (A) first ones. 
This permutation can be carried out simultaneously with the Gram-Schmidt 
procedure: if a norm is zero at step i + 1, we perform a circular permutation 
from the (i + 1)th column to the nth. Permutation matrices are orthogonal, 
so the change of variable z = P*y yields 


lb — Aylln = |] - APP*y|ln = ||b — (AP) 2|ln- 


We apply the Gram-Schmidt procedure to the matrix AP up to step rk (A) 
(we cannot go further). Hence, we obtain orthonormal vectors q1,...,@rk(A); 
to which we can add vectors qrk (A)+1>- » - , qn in order to obtain an orthonormal 
basis of R”. We call Q the matrix formed by these column vectors. We have 


i—i 


qi = Da , Isis rk(A), 
llai — pa (dk, aidr 


and a; € span {ai,...,ark(4)} = span {q,.--;drk(4)} if rk(A)+1<i<n. 
Therefore, we infer that there exist scalars r;,; such that 
i 
Qi = 5 Tk ilk, with ra >0 ifl <i< rk (A), 


k=1 
ae (7.12) 


Qi = 5 Tk idk if rk(A)+1<i<n. 
k=1 
We set rki = 0 if k > i, and call R the upper triangular matrix with entries 
(Tki): 


T11- Tl, rk(A) 


Ria R : 
R= ( d 2 with Ria = 


0 T rk (A), rk (A) 


Relations (7.12) are simply written AP = QR. Let z = (21, 22) with z the 
vector of the first rk (A) entries, and z2 that of the last n — rk (A). We have 


||b—AP2l|2, = ||Q*b—Rezl|z, =|| (Q*b)1—R1,121— R1, 222|lrgca) HIQ Della- re (a): 


Since R, 1 is upper triangular and nonsingular, by a simple back substitution, 
and whatever the vector z2 is, we can compute a solution: 
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z = R11 ((Q*b)1 — R1,222)- (7.13) 
Consequently, the value of the minimum is 
ICQ Dall- (a) = min Ilè — Aull. 


Since z2 is not prescribed, there is an infinite number of solutions (a vector 
space of dimension n — rk (A)) to the least squares problem. 


Case n < p 


In this case, we always have Ker A # {0}. Therefore, there is an infinity 
of solutions. For simplicity, we assume that the rank of A is maximal, i.e., 
equal to n. Otherwise, we have to slightly modify the argument that follows. 
Let a1,...,@p E R” be the columns of A. Since rk(A) = n, possibly after 
permuting the columns, the first n columns of A are linearly independent in 
R” and we can apply the Gram-Schmidt procedure to them. We thus obtain 


an orthogonal matrix Q of size n, with columns qj,..., qn satisfying 
ai — Jhi (dk Oi) Me 
qi = — l<i<n. 


= a= : 
llai — Dra evs) ae 


On the other hand, an41,...,@p) are spanned by q,..-,@n, which is a basis of 
R”. That is, there exist entries rj; such that 


i 
ai = yrds with ri >Oifl<i<n, 


k=1 
n 


ai = X Ride ifn+1l<i<p. 
k=1 


Set re; = 0 if k > i, and call R the n x p upper triangular matrix with entries 


(Tei)! 
R=(Rii Riz), with Ria = D ; 
and R12 is an n x (p — n) matrix. Set z = (z1, 22) with z the vector formed 
by the first n entries, and z2 by the last p — n. We have 
|b — Az||n = ||Q*b — R1,121 — Rı,222lļln- 


Since R,,; is upper triangular and nonsingular, by a simple back substitution, 
and for any choice of z2, we can compute a solution: 


z1 = RI 1 ((Q*b) — Ri222). (7.14) 
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As a consequence, the minimum value is 


0 = min ||b— Aylin. 
min | yl| 


Since z2 is not prescribed, there is an infinite number of solutions (a vector 
space of dimension p — n) to the least squares problem. 


Case n > p 


This is the most widespread case in practice, that is, there are more equa- 
tions than unknowns. For simplicity, we assume that Ker A = {0} (which is 
equivalent to rk (A) = p), so the least squares fitting problem has a unique 
solution. If Ker A 4 {0}, then what follows should be modified as in the case 


n= Dp. 

We apply the Gram-Schmidt procedure to the (linearly independent) 
columns 41, ..., ap of A in order to obtain orthonormal vectors q1,..., qp. We 
complement this set of vectors by qp41,.-.,@n to get an orthonormal basis of 
R”. We call Q the matrix formed by these column vectors. We have 

i 
= Sore idk: with ra >0if1<i< p. 
k=1 


Set rki = 0 if k > i, and call R the n x p upper triangular matrix with entries 
(rki): 


ril- Tip 


R= ee with Ria = 


0 Tp,p 


Denoting by (Q*b)p (respectively, (Q*b)n—p) the vector of the first p (respec- 
tively, the last n — p) entries of Q*b, we write 


[|b — Az||} = IQ" — Rzl|2 = I(Q*b)p — Rarzllp + [(Q*6)n—plln—p- 


Since A, is upper triangular and nonsingular, by a simple back substitution 
we can compute the solution 


z = Ri (Q*b)p- (7.15) 
Consequently, the minimum value is 


I(Q*b)n-plln-p = min |b- Ayl|n. 


Note that in this case, there is a unique solution to the least squares problem 
given by formula (7.15). 
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Operation Count 


We compute the number of multiplications required by the Gram-Schmidt 
algorithm when n > p. 


e Orthonormalization: at each step 1 < i < p, we compute i — 1 scalar 
products of vectors of R”, and i — 1 vector-scalar products. The number 
of operations Nop is therefore 


p 
Nop © 5 2(i —1)n & np’. 
i=1 


e Updating the right-hand side: the cost of the matrix-vector product Q*b 
can be reduced by remarking that in (7.15), only the first p entries of Q*b 
are required; hence 

Nop © pn. 


e Substitution: solving a triangular linear system of size p costs 
Nop © p /2. 


For large n, the QR method with the Gram-Schmidt algorithm is less 
efficient than the normal equation method if we compare the number of op- 
erations. The triangular system Ra = Q*b (in the case n = p) is, however, 
better conditioned. Indeed, cond2(R) = cond2(A), since R and A differ by 
the multiplication of an orthogonal matrix (for the normal equation method, 
it is the conditioning of A*A that matters). Nevertheless, in practice, this al- 
gorithm is not recommended for large matrices A (the following Householder 
algorithm shall be preferred). Indeed, the Gram-Schmidt algorithm is numer- 
ically unstable in the sense that for large values of p, the columns of Q are no 
longer perfectly orthogonal, so Q~! is numerically no longer equal to Q*. 


7.3.4 Householder Algorithm 


The Householder algorithm is an implementation of the QR method that 
does not rely on the Gram-Schmidt algorithm. It amounts to multiplying 
the matrix A by a sequence of very simple orthogonal matrices (the so-called 
Householder matrices) so as to shape A progressively into an upper triangular 
matrix. 


Definition 7.3.1. Let v € R” be a nonzero vector. The Householder matrix 
associated with the vector v, denoted by H(v), is defined by 


t 

VU 

A = {/ —2—_. 
) Tel? 


We set H(0) = I; the identity is thus considered as a Householder matrix. 
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Remark 7.3.2. The product vv! of an n x 1 matrix by a 1 x n matrix is indeed 
a square matrix of order n, which is equivalently denoted by v & v. It easy to 
check by associativity that the product (vv‘)a is equal to (uv, x)v. 


Householder matrices feature interesting properties that are described in 
the following result; see also Figure 7.2. 


Fig. 7.2. Householder transformation H(u) = orthogonal symmetry with respect 
to the hyperplane that is orthogonal to u. 


Lemma 7.3.1. Let H(v) be a Householder matriz. 


(i) H(v) is symmetric and orthogonal. 
(ti) Let e be a unitary vector. Vu € R”, we have 


A(v + |lvlle)v = -llle (7.16) 


and 


H(v — |lvlle)u = +]lvlle. (7.17) 


Proof. Obviously H(v)' = H(v). On the other hand, we have 


H?=1-4 LA pae =I, 
Ilo kal 
since (vv’)(vu’) = ||v||?(vu’). Hence H(v) is also orthogonal. Without loss 
of generality, we can assume that e is the first vector of the canonical basis 
(€;)1<i<n- Let w = v+ ||v]|e. If w is the null vector, then H(w)v = v = —|lvlle, 


and relation (7.16) holds. For w ¥ 0, 


ww o (lell? + [lvllv:) + llelle) 


one (or HNE Ery Wel 
2(llell? + lluller)(v + lelle) 


= =v v+ vlile) = —llvlle. 
Zlo? + 2v1 [lol] (v + |lvlle) lloll 
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A similar computation gives H (v — ||vu|le)u = |lvlle. 


We describe the Householder algorithm in the case n > p, that is, the 
matrix A E€ M,,(R) has more rows than columns. The Householder algorithm 
defines a sequence of Householder matrices H* € M,,(IR) and a sequence of 
matrices A*t! € M,,,(R) for 1 < k < p satisfying 


A= A, AP SH, APSR, 


where R is upper triangular. Each matrix A’ has zeros below the diagonal in 
its first (k — 1) columns, and the Householder matrix H* is built in such a 
way that it reduces to zero the kth column below the diagonal in A**?. 
Step 1 We set A! = A. Let a! be the first column of A!. If we have 


1,1 


then we are done by taking H! = T. Otherwise, we set 
H' = H(a' + |la'llex), 


and define A? = H! A!. By virtue of Lemma 7.3.1, the first column of A? is 


—|la*|| 
0 
A®e, = Hq! = 


0 


which is the desired result at the first step. We could also have taken Ht = 
H(a' — ||a}||e1); we choose the sign according to numerical stability criteria. 
Step k Assume that the first k — 1 columns of A* have zeros below the 
diagonal: 


P 
0 
Ak = A —1,k-1 * as 
0 
k 
0 0 x an p 


Let a! be the vector of size (n +1 -— k) made of the last (n + 1 — k) entries of 
the kth column of A’. If we have 


7.3 Numerical Algorithms 139 


then we choose H* = I. Otherwise, we set 
Tp-1 0 
He = 
0 |H(a* + |la*\le1) 


where I,_1 is the identity matrix of order k — 1, and H(a* + |la*|le1) is a 
Householder matrix of order (n + 1—k). We define A**+1 = H} AF. By virtue 
of Lemma 7.3.1, the kth column of A**? is 


ay y 
at y at y 
zal pl: akik 
A = ak kli až k = | =la" | 
, , 0 
ak H(ak + |a" ||e1)a® 
0 


Furthermore, in view of the structure of H*, the first (k — 1) columns of A*+! 
are exactly the same as those of A*. Consequently, 


k+1 k+1 
py See ee aa aip 

0 

: k+1 

: Gos XM ae OX 

Akt — k,k ; 
0 
k+1 
Ou D Kansas 


which is the desired result at the kth step. We could also have taken H(a* — 
\|a*||e,) in H*; we choose the sign according to numerical stability criteria. 
After p steps we have thus obtained an upper triangular matrix A?+! such 
that 
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p+1 pt+l 
ayy - Ap 
0 
Š y p+1 
R= APti= i ` p,p 
O ... O 


Setting Q = H!.--H?, which is an orthogonal matrix, we have indeed ob- 
tained that A= QR. 


Remark 7.8.8. The QR factorization by the Householder algorithm is possible 
even if the matrix A is singular. It is an advantage with respect to the Gram- 
Schmidt algorithm, where column permutations are required if A is singular. 
The Householder algorithm is numerically stable, so it is the one used in 
practice. The algorithm still works if n < p with obvious modifications. 


Operation Count 


At each step k, the vector a + ||a*||e1 is computed as well as the product 
of the matrices H! and A’. Due to the special shape of the matrix H*, this 
matrix product is equivalent to running (p + 1 — k) operations of the type 


(vu')a — (v,a)u 
lloll? loll? 


where v = aë + |jaf||e1, and a is successively the vector containing the last 
(n +1 —k) entries of the last (p +1 — k) column vectors of A*. Hence, there 
are mainly (p + 1 — k) scalar products and vector-scalar multiplications (we 
neglect all lower-order terms such as, for example, the computation of ||u]]). 
Finally, we obtain 


p 
Nop © X 2(p+ 1- k)(n+1-— k) = np’ — 
k=1 

This number of operations is smaller than that for the Gram-Schmidt pro- 
cedure. In the case n = p, this method can be used to solve a linear system, 
and the number of operations is of order 2n3, which makes the Householder 
algorithm twice slower than the Gauss elimination algorithm. We shall use 
Householder matrices again in Chapter 10, concerning eigenvalue computa- 


tions. 


7.4 Exercises 


7.1. Define a matrix A=reshape(1:28,7,4) and vectors b1=ones(7,1), and 
b2=[13;2;3;4;3;2;1]. 
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1. Compute the solutions x; and x2, of minimal norm, for the least squares 
problems 
min ||Ax — bi|/2 and min || Az — bəlļļ2, 
xeER3 xzeR3 


as well as the corresponding minimum values. Comment. 
2. How can the other solutions of these two minimization problem be com- 
puted? 


7.2. Define a matrix A by 
A=reshape(1:6,3,2);A=[A eye(size(A)); -eye(A) -A]; 


1. Define a vector bọ by bO=[2 4 3 -2 -4 -3]’. Compute the solution zo, of 
minimal norm, of the least squares problem associated to the matrix A and 
bo. Let b bea (small) variation of bp defined by e=1.E-2; b=b0+e*rand (6,1). 
Compute the solution x associated with b. Compute the relative errors 
|x — xo||2/||xo|| and ||b— bol|2/||bo||. Compute the amplification coefficient 
Cy defined by equality (7.9). 

2. Same questions for the vector bı defined by b1=[3 0 -2 -3 0 2]’. Dis- 
play both vectors zı and x. What do you observe? 

3. For į varying from 1/100 to 1 by a step size of 1/100, compute the am- 
plification coefficient C, (i) associated with the vector bz = ibo + (1 — i)b1. 
Plot the results on a graph. Comment. 


7.3. The goal of this exercise is to approximate a smooth function f defined 
on the interval (0,1) by a polynomial p € P,,_, in the least squares sense, i.e., 


[ ro- x)[dz = min 3 Ifa) - ale)? de. (718) 


qEPn—1 


Writing p = Di _, @iyi in a basis (Y;);—1 of P,_-1, the unknowns of the problem 
are thus the coefficients a;. Problem (7.18) is equivalent to determining the 
minimum of the function E from R” into R, defined by 


n 
E(ay,...,Qn) = -X aivi(x) dx, 
i=1 
which admits a unique solution (a1, ...,an) characterized by the relations 

OE 

—— (a (Qy4)=0, BR=1,...;7 

Oa, 

1. Show that the vector a = (a1,...,@n)° is the solution of a linear system 


Aa = b whose matrix and right-hand side are to be specified. 
2. Take y;(x) = 2’! and show that the matrix A is the Hilbert matrix; see 
Exercise 2.2. 
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3. Prove, using the Gram-Schmidt procedure, that there exists a basis 
(pi) of Pp—1 such that 


1 
| pil(x)y;(x)dt = dij Lij an: 
0 


What is the point in using this basis to compute p? 


7.4. Define a matrix A by A=MatRank (300,100,100) (see Exercise 2.7), and 
b=rand(300,1). 


1. Compare (in terms of computational time) the following three methods 
for solving the least squares problem 


min | Az — blo. (7.19) 


(a) The Cholesky method for solving the normal equations. Use the func- 
tion chol. 

(b) The QR factorization method. 

(c) The SVD method where A is factorized as A = V XU* (see Theorem 
2.7.1). Hint: the solutions of (7.19) are given by x = Uy, where y € R” 
can be determined explicitly in terms of the singular values of A and 
the vector b. 

Compute the solutions x of (7.19), the minima || Az — b||, and the compu- 

tational time. Conclude. 

. Now define A and b by 


e=1.e-5;P=[1 1 0;0 1 -1; 10 -1] 
A=Pxdiag([e,1,1/e])*inv(P) ; b=ones (3,1) 


N 


Compare the solutions of the least squares problem obtained by the 
Cholesky method and the QR method. Explain. 


7.5 (*). Program a function Householder to execute the QR factorization of 
a matrix by the Householder algorithm. Compare the results with the factor- 
ization obtained by the Gram-Schmidt method. 


8 


Simple Iterative Methods 


8.1 General Setting 


This chapter is devoted to solving the linear system 
Az =b 


by means of iterative methods. In the above equation, A € M,,(R) is a non- 
singular square matrix, b € R” is the right-hand side, and x is the unknown 
vector. A method for solving the linear system Ax = b is called iterative if 
it is a numerical method computing a sequence of approximate solutions £k 
that converges to the exact solution x as the number of iterations k goes to 
+00. 

In this chapter, we consider only iterative methods whose sequence of 
approximate solutions is defined by a simple induction relation, that is, £k+1 
is a function of x, only and not of the previous iterations £k—1,.-.-, £1- 


Definition 8.1.1. Let A be a nonsingular matrix. A pair of matrices (M, N) 
with M nonsingular (and easily invertible in practice) satisfying 


A=M-N 


is called a splitting (or regular decomposition) of A. An iterative method based 
on the splitting (M, N) is defined by 


e given in R”, (8.1) 


Mzk}1ı = Nzk+b Vk>1. 


In the iterative method (8.1), the task of solving the linear system Ax = b 
is replaced by a sequence of several linear systems Maz = b to be solved. 
Therefore, M has to be much easier to invert than A. 


Remark 8.1.1. If the sequence of approximate solutions x, converges to a limit 
x as k tends to infinity, then by taking the limit in the induction relation (8.1) 
we obtain 
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(M — N)z = Ax =b. 


Accordingly, should the sequence of approximate solutions converge, its limit 
is necessarily the solution of the linear system. 

From a practical viewpoint, a convergence criterion is required to decide 
when to terminate the iterations, that is, when 2, is sufficiently close to the 
unknown solution æ. We will address this issue at the end of the chapter. 


Definition 8.1.2. An iterative method is said to converge if for any choice of 
the initial vector xo € R”, the sequence of approximate solutions xk converges 
to the exact solution x. 


Definition 8.1.3. We call the vector re = b— Ax, (respectively, ek = £k — £) 
residual (respectively, error) at the kth iteration. 


Obviously, an iterative method converges if and only if e, converges to 
0, which is equivalent to re = Ae, converging to 0. In general, we have no 
knowledge of ep because x is unknown! However, it is easy to compute the 
residuals rg, so convergence is detected on the residual in practice. 

The sequence defined by (8.1) is also equivalently given by 


tho. = MN + Mtb. (8.2) 


The matrix M~!N is called an iteration matrix or amplification matrix of 
the iterative method. Theorem 8.1.1 below shows that the convergence of the 
iterative method is linked to the spectral radius of M~!N. 


Theorem 8.1.1. The iterative method defined by (8.1) converges if and only 
if the spectral radius of M~1N satisfies 


o(M-*N) <1. 
Proof. The error ex is given by the induction relation 


ek = £k — £ = (M-N xp- + M~'d) — (M~*Nax+ Mtb) 
= MHN (zk-1 — 2) = M7 Neg-1. 


Hence eg = (M~!N)*eo, and by Lemma 3.3.1 we infer that limz_.+. €k = 0, 
for any eọ, if and only if e(M~!N) <1. 


Example 8.1.1.'To solve a linear system Ax = b, we consider Richardson’s 
iterative method (also called gradient method) 


Tk+1 = Tk + a(b = Azk), 


where a is a real number. It corresponds to the splitting (8.1) with M = a~ 1I 
and N = a™tI — A. The eigenvalues of the iteration matrix Ba = I — aA 
are (1 — aA;), where (A;); are the eigenvalues of A. Richardson’s method 
converges if and only if |1 — aA;| < 1 for any eigenvalue A;. If the eigenvalues 
of A satisfy 0 < Ay < +--+ < An = O(A), the latter condition is equivalent to 
a € (0,2/0(A)); see Figure 9.1. 
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In some cases, it is not necessary to compute the spectral radius of M~!N 
to prove convergence, as shown in the following theorem. 


Theorem 8.1.2. Let A be a Hermitian positive definite matrix. Consider a 
splitting of A= M — N with M nonsingular. Then the matrix (M* + N) is 
Hermitian. Furthermore, if (M* + N) is also positive definite, we have 


o(M-*N) <1. 


Proof. First of all, M* + N is indeed Hermitian since it is the sum of two 
Hermitian matrices 


M* + N = (M* — N*) + (N* +N) = Æ + (N* +N). 
Since A is positive definite, we define the following vector norm: 


lzla = y (Az, x), Va ER”. 


We denote by ||.|| the matrix norm subordinate to |.|4. Let us show that 
|| !Z—'N|| < 1, which yields the desired result thanks to Proposition 3.1.4. By 
Proposition 3.1.1, there exists v depending on M~!N such that |v|4 = 1 and 
satisfying 
| M-1N||? = mas |M-*Na|?, = |M~*No|%. 
a= 


Since N = M — A, setting w = M7! Av, we get 


|M-*No|*, = (AM~"Nv, M~* Nv) = (AM~1(M — A)v, M7! (M — A)v) 

= ((Av — AM Av), (I — M~*A)v) 

= (Av,v) — (AM~' Av, v) + (AM~1 Av, M~1 Av) — (Av, M~* Av) 
= 1-(w, Mw) + (Au, w) — (Mw, w) = 1 — ((M* + N\w, w). 

By assumption, (M* + N) is positive definite and w Æ 0, since A and M are 
nonsingular. Thus ((M* + N)w, w) > 0. As a result, | M~'N||? = 1 — ((M* + 
N)w,w) <1. 


Iterative methods for solving linear systems may require a large number of 
iterations to converge. Thus, one might think that the accumulation of round- 
ing errors during the iterations completely destroys the convergence of these 
methods on computers (or even worse, makes them converge to wrong solu- 
tions). Fortunately enough, this is not the case, as is shown by the following 
result. 


Theorem 8.1.3. Consider a splitting of A= M — N with A and M nonsin- 
gular. Let b € R” be the right-hand side, and let x € R” be the solution of 
Ax =b. We assume that at each step k the iterative method is tainted by an 
error €k E R”, meaning that £k41 is not exactly given by (8.1) but rather by 


£g} = M-1Na, + M~'d+ ex. 
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We assume that o0(M~'N) < 1, and that there exist a vector norm and a 
positive constant € such that for all k > 0, 


llexll < €. 


Then, there exists a constant K, which depends on M-!N but not on £, such 
that 
lim sup ||xzk — z|| < Ke. 


—-+00 


Proof. The error ek = x, — x satisfies ep1 = M~!Nex + €k, so that 


k-1 
a 
ex = (MN) eo + XC (MIN) ek-i~1. (8.3) 
i=0 
By virtue of Proposition 3.1.4, there exists a subordinate matrix norm ||- ||, 


such that |M+tN]||s < 1, since o(M71N) < 1. We use the same notation for 
the associated vector norm. Now, all vector norms on R” are equivalent, so 
there exists a constant C > 1, which depends only on M~!N, such that 


Cllyll < llylls < Cllyl, Yy € R”. 


Bounding (8.3) from above yields 


k—1 
, Ce 
s LIMINE 2 M-!N||'Ce<||M-1NI* ghee ee 
llexlls < || Ils lleoll vs | I|sCe< || II; lleoll tI- JMN]: 


Letting k go to infinity leads to the desired result with K = C?/(1 — 
|MN]]). 


Iterative methods are often used with sparse matrices. A matrix is said 

to be sparse if it has relatively few nonzero entries. Sparse matrices arise, for 
example, in the discretization of partial differential equations by the finite 
difference or finite element method. A simple instance is given by tridiagonal 
matrices. 
Storage of sparse matrices. The idea is to keep track only of nonzero entries 
of a matrix A, thereby saving considerable memory in practice for matrices 
of large size. We introduce sparse or Morse storage through the following 
illustrative example; for more details we refer to [5], [14]. 


Example 8.1.2. Define a matrix 


9 0 -3 0 
7T -1 0 4 

ANG se ae) 
1 0 1-9 


The entries of A are stored in a vector array STOCKA. We define another 
array BEGINL, which indicates where the rows of A are stored in STOCKA. 
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More precisely, STOCKA(BEGINL(7)) contains the first nonzero entry of row 
i. We also need a third array INDICC that gives the column of every entry 
stored in STOCKA. If a; j is stored in STOCKA(k), then INDICC(k) = j. The 
number of nonzero entries of A is equal to the size of the vectors INDICC and 
STOCKA. The vector BEGINL has size (n + 1), where n is the number of rows 
of A, because its last entry BEGINL(n + 1) is equal to the size of INDICC and 
STOCKA plus 1. This is useful in computing the product z = Ay with such a 
storage: each entry z(i) of z is given by 


t= 


k 
ai) = > srocka(k) y(INDICC(k)), 
k=k 


where ki = BEGINL(i) and (kn41 — 1) is precisely the size of INDICC and 
STOCKA (see Table 8.1 for its application to A). 


2| 4 
—l| 3 

1 1 

2) 3 

5j 2 

4|| 4 11 
—1 2 8 

7| 1 6 
—3|| 3 3 

9| 1 1 

STOCKA||INDICC]| BEGINL 


Table 8.1. Morse storage of the matrix A in (8.4). 


8.2 Jacobi, Gauss-Seidel, and Relaxation Methods 


8.2.1 Jacobi Method 
For any matrix A = (a;,;)1<i,j<n, its diagonal D is defined as 
D= diag (a1, PPE vana) 


Definition 8.2.1. The Jacobi method is the iterative method defined by the 
splitting 
M=D, N=D-A. 


The iteration matrix of this method is denoted by J = M~'N = I — D! A. 
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Remark 8.2.1. 
1. The Jacobi method is well defined if the diagonal matrix is nonsingular. 
2. If A is Hermitian, the Jacobi method converges if A and 2D — A are 
positive definite (by virtue of Theorem 8.1.2). 
3. There exists a block Jacobi method; see Section 8.6. 


8.2.2 Gauss-Seidel Method 


For any matrix A = (a;,;)1<i,j<n, consider the decomposition A = D- E-F, 
where D is the diagonal, —F is the lower part, and —F is the upper part of 
A. Namely, 

dig = 04,5 0i,53 

eij = —ai,j ift > j,and 0 otherwise; 

fij =—aij ifi< j,and 0 otherwise. 


Definition 8.2.2. The Gauss-Seidel method is the iterative method defined 
by the splitting 
M=D-E, N=F. 


The iteration matrix of this method is denoted by Gi = M~!N = (D—E)!F. 


Remark 8.2.2. 

1. The Gauss-Seidel method is well defined if the matrix D — E is nonsin- 
gular, which is equivalent to asking that D be nonsingular. 

2. The matrix (D — E) is easy to invert, since it is triangular. 

3. If A is Hermitian and positive definite, then M*+N = D is also Hermitian 
and positive definite, so Gauss-Seidel converges (by virtue of Theorem 
8.1.2). 

4. There exists a block Gauss-Seidel method; see Section 8.6. 


Comparison between the Jacobi method and the Gauss-Seidel method. 


In the Jacobi method, we successively compute the entries of «**+! in terms 
of all the entries of x*: 
k+1_ 1 k k k Bach 
zi Gy aint] = diii a — Miitti = — Giny + bi]. 
Lt 


In the Gauss-Seidel method, we use the information already computed in the 
(i — 1) first entries. Namely, 


k+l 


k+1 1 
L; = —[-a, 127 


k+1 k k 
i TE i Qi i-1U;_ 4 = Qi i+1Ti+1 = e Qi nTn + bi]. 
Usd 


From a practical point of view, two vectors of size n are required to store z” 
and a**+1! separately in the Jacobi method, while only one vector is required 


in the Gauss-Seidel method (the entries of x**1 progressively override those 
of zë). 
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8.2.3 Successive Overrelaxation Method (SOR) 


The successive overrelaxation method (SOR) can be seen as an extrapolation 
of the Gauss-Seidel method. 


Definition 8.2.3. Let w € Rt. The iterative method relative to the splitting 


D i= 
M=2-E, = pap 
W W 


is called relaxation method for the parameter w. We denote by Gu the iteration 


matrix i 
D er ae 
Gy = MN = (2-2) (—“p+F). 


WwW WwW 


Remark 8.2.8. 

1. The relaxation method is well defined if the diagonal D is invertible. 

. If w = 1, we recover the Gauss-Seidel method. 

.If w <1, we talk about an under-relaxation method. 

. If w > 1, we talk about an over-relaxation method. 

. The idea behind the relaxation method is the following. If the efficiency 
of an iterative method is measured by the spectral radius of its iteration 
matrix M~1N, then, since 0(G.,) is continuous with respect to w, we 
can find an optimal w that produces the smallest spectral radius possible. 
Accordingly, the associated iterative method is more efficient than Gauss- 
Seidel. We shall see that in general, wopt > 1, hence the name SOR (over- 
relaxation). 

6. A block relaxation method is defined in Section 8.6. 

7. A relaxation approach for the Jacobi method is discussed in Exercise 8.7. 


oR Ww be 


Theorem 8.2.1. Let A be a Hermitian positive definite matriz. Then for any 
w € ]0,2[, the relaxation method converges. 


Proof. Since A is definite positive, so is D. As a result, a — E is nonsingular. 


Moreover, 
D j= 2- 
M*+N= E+ED + F= a 
W WwW Ww 


since E* = F. We conclude that M* + N is positive definite if and only if 
0 <w <2. Theorem 8.1.2 yields the result. 


Theorem 8.2.2. For any matriz A, we always have 
(Gus) = |1 — wI, Vw Æ 0. 


Consequently, the relaxation method can converge only if 0 <w <2. 


150 8 Simple Iterative Methods 


Proof. The determinant of Gu is equal to 
1- D 
det (Gu) = det ((=#p+F) / det (2 — £) =(1- w)”. 
w w 
We deduce that 


o(Gu)” > [I A(G.) |=| det (Gu) |=| 1- o |", 


i=1 


where A;(G,,) are the eigenvalues of Gu. This yields the result. 


8.3 The Special Case of Tridiagonal Matrices 


We compare the Jacobi, Gauss-Seidel, and relaxation methods in the special 
case of tridiagonal matrices. 


Theorem 8.3.1. Let A be a tridiagonal matrix. We have 


(G1) = oT)’, 


so the Jacobi and Gauss-Seidel methods converge or diverge simultaneously, 
but Gauss-Seidel always converges faster than Jacobi. 


Theorem 8.3.2. Let A be a tridiagonal Hermitian positive definite matriz. 
Then all three methods converge. Moreover, there exists a unique optimal pa- 
rameter Wopt in the sense that 


OGuore) = olin, (Gu), 


where 
2 


Wopt = — a? 
14 V1- lay 
and 


O(Gurape) = Wopt — L; 


Remark 8.3.1. Theorem 8.3.2 shows that in the case of a tridiagonal Hermitian 
positive definite matrix, we have wWopt > 1. Therefore, it is better to perform 
overrelaxation than underrelaxation. 


To prove the above theorems, we need a technical lemma. 


Lemma 8.3.1. For any nonzero real number u # 0, we define a tridiagonal 
matrix A(u) by 
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by pte 0 
a 8 ; 
Ae e 
Ta aia eT 
0 Lan bn 


where ai, bi, ci are given real numbers. The determinant of A(n) is independent 
of ww. In particular, det A(w) = det A(1). 


Proof. The matrices A(w) and A(1) are similar, since A(u)Q(u) = Q() A(1) 
with Q(u) = diag (u, w?,..., u”), which yields the result. 


Proof of Theorem 8.3.1. The eigenvalues of A are the roots of its character- 
istic polynomial P4(A) = det (A — AI). We have 


Pz(A) = det (—D~) det (AD — E — F) 


and 
Pg, (à?) = det (E — D)™ det (X° D — X E — F). 


We define a matrix A(n) by 
A(u) = XD — uX E — iF 
By Lemma 8.3.1, we get det A(+) = det A(1). Hence 
Pg, (à?) = (-1)"A" P7 (A). 


As a consequence, for any \ # 0, we deduce that A is an eigenvalue of J if 
and only if A? is an eigenvalue of G1. Thus, 0(G1) = o(7)?. 


Proof of Theorem 8.3.2. Since A is Hermitian, positive definite, we already 
know by Theorem 8.2.1 that the relaxation method converges for 0 < w < 2. 
In particular, the Gauss-Seidel method converges. Now, Theorem 8.3.1 states 
that o(J)? = o0(Gi) < 1. Therefore, the Jacobi method converges too. It 
remains to determine wop¢. Let A(u) be the matrix defined by 


_M+w-1 


Alp) D — uX E p 
H 


By Lemma 8.3.1, we know that det A(+) = det A(1). Accordingly, 
M+u—1 M+w-1 
w 


det Io 


D- XE F) =a" det ( D-E F). 


Observing that 


Po, (02) = det (e- 2) det (; TUSLA ee: P), 
WwW WwW 
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we deduce that there exists a constant c (independent of A) such that 


aten 


Pg, (à?) = cà” P7 ( X 
W 


In other words, for any À # 0, A? is an eigenvalue of Go if and only if (A? + 
w — 1)/(àw) is an eigenvalue of J. For an eigenvalue a of J, we denote by 
A= (a) the two roots of the following equation: 


M+w-1 
Oe ay 


Aw , 
that is, 
ee aw + /a2w? — 4(w — 1) l 
2 
We have just proved that u+ (a) = At (a)? and u~ (a) = à~ (a)? are eigenval- 
ues of Gu. Now, if a is an eigenvalue of 7, so is —a. Then, A+ (a) = —A~(—a). 
Hence, 
— -+ 
Aul = as, e e, 
with i 
lut (a| = AC 2w +2) 4 = a?w? — A(w —1)}. (8.5) 


In order to compute wopt, we have to maximize (8.5) over all eigenvalues of 
J. Let us first show that the eigenvalues of 7 are real. Denote by a and v 4 0 
an eigenvalue and its corresponding eigenvector of 7. By definition, 


Jv = av & (E+ F)v=aDu & Av = (1-a)Dv. 
Taking the scalar product with v yields 
(Av, v) = (1 — a) (Dv, v), 


which implies that (1 — a) is a positive real number, since A and D are 
Hermitian positive definite. The next step amounts to computing explicitly 
|u*(a)|. Note that u™ (a) may be complex, because the polynomial a?w? — 
4w + 4 = 0 has two roots: 


1l- V1 =a?’ 
and may therefore be negative. Since |a| < o(J) < 1, we get 
1<wt(a)<2<w (a). 


If wt (a) <w < 2, then w*(a) is complex, and a simple computation shows 
that |uw*(a)| = |w — 1| = w — 1. Otherwise, since w € (0,2), we have 0 < w < 
w*t(a), and u” (a) is real. Thus 
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ut (a)| = w-—1 ifwt(a)<w<2, 
H SAIE ] Ata)? if 0 <w <wt(a). 


When u (a) is real, we have At (a) > AT (—a) if a > 0. Furthermore, wt (a) = 
wt(—a), so 

la+ (a)| > |u*(-a)|, ifa>0. 
In other words, we can restrict the maximization of |u*(a)| to positive a. 
Moreover, for œ > 0 we have 


2 


d 2) _ i wa 
a (AT (a)?) = à+ (a) ( | =) > 0. 


azw?2 — 4 
Accordingly, for a fixed w, the maximum is attained at a = 0( J): 


o(Gu) = |u" (lT) = mex, la+ (a)l. 


From now on, we replace a by the maximizer o( J) and we eventually minimize 
with respect to w. The derivative is 


£ (At (0(T))?) = 2A* (0(T)) EAT (0(T)) 


— 9)+ oT) 20(J)*w—4 
= 2A (0(.7)) 2 i 4 Aei) 


= AH TELD UTD, 


VO T)?w?—4(w—1) 
Since 0 < o(J) < 1 and AT (0(7)) < o(Gu) < 1, we deduce 


d y+ 2 

and the minimum of At(0(.7))? on [0,wt(o(J))] is attained at wt(o(7)). 
Likewise, the minimum of w— 1 on [w*(0(7)), 2] is attained at w*(o(.7)). We 
deduce that as w varies in ]0, 2[, the minimum of 0(G.,) is attained at wt (o(7)) 
and we obtain (see Figure 8.1) minocw<2 0(Gu) = Wopt — 1, and wopt = 


w*t(o(7)). 


Remark 8.3.2. If only a rough approximation of the optimal parameter Wopt 
is available, it is better to overevaluate it than to underevaluate it, since (see 
Figure 8.1) 
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Ye 


0 il Wopt 2 


Fig. 8.1. Spectral radius of Gu in terms of w. 
8.4 Discrete Laplacian 


We revisit the finite difference discretization of the Laplacian (see Sections 
1.1 and 5.3.3), which leads to the linear system Ang = b, with a tridiagonal 
symmetric positive definite matrix A, E€ Mn-1(R), defined by (5.12), and 
b € R”. The eigenvalues of A, are (see Section 5.3.3) 


T 
Ak = 4n? sin? (kZ), E aD 
2n 

We now compare some iterative methods for solving the corresponding linear 
system. As usual (£k) denotes the sequence of vector iterates and e, = £k — £ 
is the error. 


v Jacobi method. According to our notations we have 


1 
M =2n?In-1, N=2n?In-1—An, J =M!N =I- 5p An 
n 
The eigenvalues of the Jacobi matrix J are therefore up = 1 — àg /(2n7), 
with A, eigenvalue of An, and 


T N T 
o(J)= max [cos k= | = max lcos =| = cos —. 
1<k<n-1 n ke {1,n—1} n n 


Since o(J) < 1, the Jacobi method converges, and as n — +00, 


The matrix J = I — sz An is symmetric, and therefore normal. Thus, 
IJ" ll2 = o(T)* and |lex|l2 < oJ)‘ lleolle. 

Let £ be a given error tolerance. What is the minimal number of iterations 
ko such that the error after kg iterations is reduced from a factor e? In 
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mathematical terms we want |le,||2 < elleo||2, for k > ko. We compute an 
upper bound by looking for ko such that 0(7)*° < e. We find 

lne lne 


= —— za Or? i = —2——. 
ko mod) Cn, with C -2 


V Gauss-Seidel method. Since An is tridiagonal, we have 
T 
(G1) = oF)? = 0082 <1, 


and the Gauss-Seidel method converges too. As n — +00, 


Note that kı ~ kj /2. Put differently, for large values of n, the Gauss- 
Seidel method takes half as many iterations as the Jacobi method to meet 
some prescribed convergence criterion. 

vV Relaxation method. Since An is tridiagonal, symmetric, and positive def- 
inite, the relaxation method converges if and only if w € (0,2), and the 
optimal value of the parameter w (see Theorem 8.3.2) is 


2 2 
Wo = = - r: 
aie: 1-—o(7)? 1+sin= 
As n — +00, we have Wopt = 2(1— T) + O(n-?), ànd 


2 
(Gun) = Wopt — 1 = 1 — $E + O(n). 


With the choice w = wopt, we get |lex||2 < elleoll2 for k > ke satisfying 


lne lne 
= ———— z ith C = — — 
P ed Cn, with C 


Qn 
Note that k2 ~ 3>k1, so the convergence is much faster than for Jacobi 
or Gauss-Seidel, since we save one order of magnitude in n. 


k2 


As a conclusion, to achieve a given fixed error £, the Gauss-Seidel method 
is (asymptotically) twice as fast as the Jacobi method. The speedup is all the 
more considerable as we move from the Jacobi and Gauss-Seidel methods to 
the relaxation method (with optimal parameter), since we save a factor n. For 
instance, for n = 100 and ¢ = 0.1, we approximately obtain: 


e ko = 9342 iterations for the Jacobi method, 
e kı = 4671 iterations for the Gauss-Seidel method, 
e k= T5 iterations for the optimal relaxation method. 
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8.5 Programming Iterative Methods 


We first define a convergence criterion for an iterative method, that is, a test 
that if satisfied at the Ath iteration allows us to conclude that v.41 is an 
approximation of x and to have control over the error e€,41 = £k+1 — x made 
at this approximation. 


Convergence criterion 


Since we do not know xz, we cannot decide to terminate the iterations as 
soon as ||” — ax|| < £, where € is the desired accuracy or tolerance. However, 
we know Az (which is equal to b), and a simpler convergence criterion is 
||b — Ax;,.|| < £. Nevertheless, if the norm of A~? is large, this criterion may 
be misleading, since ||x — xx|| < || A~*|] ||b — Avg] < el|A7*||, which may not 
be small. Therefore in practice, a relative criterion is preferred: 
b= Aral < 
|b- Axol] ~ 
Another simple (yet dangerous!) criterion that is sometimes used to detect 


the convergence of «x is ||£k+1 — £p|| < £. This criterion is dangerous because 
it is a necessary, albeit not sufficient, condition for convergence (a notorious 


(8.6) 


counterexample is the scalar sequence xp = = 3 1/i, which goes to infinity 
although |£k+1 — x,| goes to zero). 


pseudolanguage algorithm 


We recall that the simple iterative method Maz,,; = Naz, +b can also be 
written as follows: 
Ceti = £k + M~'rz, (8.7) 


where rẹ = b — Az, is the residual at the kth iteration. Formula (8.7) is the 
induction relation that we shall program. 


vV The initial guess is usually chosen as 29 = 0, unless we have some infor- 
mation about the exact solution that we shall exploit by choosing 29 close 
to x. 

¥ At step 2 of the algorithm, the variable x is equal to x, as the input and 
to p41 as the output. Same for r at step 3, which is equal to rz, as the 
input and to rz4,1 as the output. 

V Step 3 of the algorithm is based on the relation 


Eki = b— ATk+1 =b— A(Tk + M~'rx) = Tk T AM~!ry. 


Y The algorithm stops as soon as ||r||2 < e||b||, which is the relative criterion 
(8.6) for zo = 0. 

vy In practice, we cut down the number of iterations by adding to the con- 
dition of the “while” loop an additional condition k < kmax, where kmax 
is the maximum authorized number of iterations. The iteration number k 
is incremented inside the loop. 
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Data: A, b. Output: x (approximation of x) 
Initialization: 
choose x € R”. 
compute r = b— Ax. 
While ||r||2 > €||b||2 
1. compute y € R” solution of 
My=r 
2. Update the solution 
x=x+y 
3. Compute the residual 
r=r— Ay 
End While 
Algorithm 8.1: Iterative method for a splitting A= M — N. 


Computational complexity. 


For iterative methods, we compute the number of operations per iteration. 


V Convergence criterion. It takes n elementary operations since the test can 
be done on ||r||3 in order to avoid computing a square root. 

V Step 1. Computing y requires n operations for the Jacobi method (M 
is diagonal), and n?/2 operations for the Gauss-Seidel and relaxation 
methods (M is triangular). 

V Step 3. The product Ay requires n? operations. 


The number of operations per iteration is at most 3n?. This is very favorable 
compared to direct methods if the total number of iterations is sensibly smaller 
than n. 


8.6 Block Methods 


We can extend the Jacobi, Gauss-Seidel, and relaxation methods to the case 
of block matrices. Figure 8.2 shows an example of block decomposition of a 
matrix A. We always have A = D— E — F where D is a block diagonal matrix, 
— E is block lower triangular, and —F is block upper triangular. Assume that 
the size of the matrix is n x n, and let A;,; (1 < i,j < p) be the blocks 
constituting this matrix. Each block A; ; has size nj x nj. In particular, each 
diagonal block A;,; is square, of size n; xn; (note that n = )°?_, ni). The block 
decomposition of A suggests the following decomposition, for any b € R”: 


b= (b!,..., 0P), BER™, 1<i<p. 


If the diagonal blocks are nonsingular, the Jacobi, Gauss-Seidel, and relax- 
ation methods are well defined. 
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D 


E 


Fig. 8.2. Example of a matrix decomposition by blocks. 


Block-Jacobi method. 


An iteration of the Jacobi method reads 


Dips. = (E + F)æp +b. (8.8) 
Writing the vectors x, € R” as x, = (£},...,£})*, with xi € R", 1 <i<p, 
equation (8.8) becomes 
p . 
Arap = bt 5 Ai jT} (8.9) 
j=2 
Aiii =b Y Aiet, for2<i<p-l, (8.10) 
i#i 
p-l ; 
Appt y41 = OP — > Ap,j Xj: (8.11) 
j=l 


Since all diagonal blocks A; ; are nonsingular, we can compute each vectorpiece 
r,4, (for i = 1,...,p), thereby determining x;,4, completely. Since at each 
iteration of the algorithm we have to invert the same matrices, it is wise to 
compute once and for all the LU or Cholesky factorizations of the blocks A; i. 

To count the number of operations required for one iteration, we assume 
for simplicity that all blocks have the same size n; = n/p. Computing 2,44 
from zp requires p(p — 1) block-vector multiplications, each having a cost of 
(n/p), and p back-and-forth substitution on the diagonal blocks, with cost 
(n/p)? again. Each iteration has a total cost on the order on n? + n?/p. 

Likewise, we can define the block-Gauss—Seidel and block-relaxation meth- 
ods. Theorems 8.2.1, 8.3.1, and 8.3.2 apply to block methods as well (see [3] 
for proofs). 
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8.7 Exercises 


8.1. Let A=[1 2 2 1;-1 2 1 0;0 1 -2 2;1 2 1 2]. Check that this matrix 
is nonsingular. In order to solve the system Ax = (1,1,1,1)', we decompose 
A = M -N (with M nonsingular) and then build a sequence of vectors by 
(8.1). 


1. Let M=diag (diag (A) ) ;N=M-A. What is the method thus defined? Compute 
100; £200, --- Does the sequence x; converge? 

2. Same questions for M=tril(A). 

3. Is the conclusion the same for M=2*tril (A)? 


8.2. Write a function JacobiCvg(A) returning the value 1 if the Jacobi 
method applied to the matrix A is well defined and converges, and return- 
ing O if it diverges. It is not asked to program the method. Same question 
for a function GaussSeidelCvg(A) informing about the convergence of the 
Gauss-Seidel method. For each of the matrices 


12.3 A 94-4 1 
4567 59 2 0 
Aim Lap o> 22> 22 1 04° 
0234 20 0 2 


do the Jacobi and Gauss-Seidel methods converge? Comment. 


8.3. For different values of n, define the matrix A=DiagDomMat (n,n) (see Ex- 
ercise 2.23). Do the Jacobi and Gauss-Seidel methods converge when applied 
to this matrix? Justify your answers. 


8.4. Let A, Mı, and Mə be the matrices defined by 


Set N; = M; — A and b = A(8,4,9,3)*. Compute the first 20 terms of the 
sequences defined by zo = (0,0,0,0) and Mi£k+ı = Nixk + b. For each 
sequence, compare £20 with the exact solution. Explain. 


8.5. Program a function [x,iter]=Jacobi(A,b,tol,iterMax,x0). The in- 
put arguments are: 


a matrix A and a right-hand side b; 

the initial vector x0; 

tol, the tolerance € for the convergence criterion; 
iterMax, the maximum number of iterations. 


The output arguments are: 
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e the approximate solution x; 
e iter, the number of performed iterations. 


Hint: use the command nargin to determine the number of input arguments. 


e If it is equal to 4, set x0=zeros(b); 
e If it is equal to 3, set x0=zeros(b) ;iterMax=200; 
e If it is equal to 2, set xO=zeros (b) ; iterMax=200;tol=1.e-4. 


For n = 20, define A=Laplacian1dD(n) ;xx=(1:n)’/(n+t1) ; b=xx.*sin(xx), 
and sol=A\b. For different values of the parameter tol= 107*,s = 2,3,..., 
compute the approximate solution x=Jacobi(A,b,tol,1000) and compare 
norm(x-sol) and norm(inv(A))*tol. Comment. 


8.6 («). Write a function [x, iter]=Relax(A,b,w,tol,iterMax,x0) pro- 
gramming the relaxation method with parameter w equal to w. For the same 
matrix A as in Exercise 8.5 and the same vector b, plot the curve that gives 
the number of iterations carried out by the relaxation method in terms of 
w = i/10 (i = 1,2,...,20). Take iterMAx = 1000, tol = 10-®, and x0 = 0. 
Find the value of w that yields the solution with a minimal number of itera- 
tions. Compare with the theoretical value given in Theorem 8.3.2. 


8.7. Let A E€ M,,(R) be a nonsingular square matrix for which the Jacobi 
method is well defined. To solve the system Ax = b, we consider the following 
iterative method, known as the relaxed Jacobi method (w is a nonzero real 
parameter): 


D 1- 
Zar = (—ED+E+F)er+b, k21, (8.12) 
WwW Ww 


and 29 € R” is a given initial guess. 


1. Program this algorithm (function RelaxJacobi). As in Exercise 8.6, find 
the optimal value of w for which the solution is obtained with a minimal 
number of iterations. Take n = 10, iterMAx = 1000, tol = 1074, x0 = 0, 
and vary w between 0.1 and 2 with a step size equal to 0.1. Compute 
the norms as well as the residuals of the solutions obtained for a value of 
w < l and a value w > 1. Explain. 

2. Theoretical study. We assume that A is symmetric, positive definite, and 
tridiagonal, and we denote by %, the iteration matrix associated with 
algorithm (8.12). 

(a) Find a relationship between the eigenvalues of the matrix Jọ and 
those of J. 

(b) Prove that the relaxed Jacobi method converges if and only if w be- 
longs to an interval J to be determined. 

(c) Find the value © ensuring the fastest convergence, i.e., such that 


Js) = inf (Fu): 


Compute o(Jo), and conclude. 
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8.8. The goal of this exercise is to study a process of acceleration of conver- 
gence for any iterative method. To solve the linear system Ax = b of order 
n, we have at our disposal an iterative method that reads 7,4; = Ba, + C, 
where B is a matrix of order n and c € R”. We assume that this iterative 
method converges, i.e., the sequence x, converges to the unique solution zv. 
The convergence acceleration of this iterative method amounts to building a 
new sequence of vectors CARN o that converges faster to x than the original 


sequence (X%);,>9- The sequence (x4), is defined by 


j 
= 5 al. Ek, (8.13) 
k=0 


where the coefficients al € R are chosen in order to ensure the fastest conver- 
gence rate. We set ex = £k — x, and e} = x} — x. We shall use Matlab only 
at the very last question of this exercise. 


1. Explain why the condition (which shall be assumed in the sequel) 


ee =1 (8.14) 


is necessary. 
2. Show that 


e541 = pj(B)eo, (8.15) 


where p; € P; is defined by p;(t) = - ad. tk, 
3. We assume that B is a normal matrix and that ||eo||2 = 1. Prove that 


leyl < Ilpj(P) Ie, (8.16) 


where D is a diagonal matrix made up of the eigenvalues of B, denoted 
by o(B). Deduce that 

r < (A)|. 8.17 

lejsalle < max IpO) (8.17) 

4. Show that the eigenvalues of B belong to an interval [—a, a] with 0 < a < 
1. 

5. Clearly the fastest convergence rate of the acceleration process is obtained 
if the polynomial pj is chosen such that the right-hand side of (8.17) is 
minimal. However, since we do not know the eigenvalues of B, we sub- 
stitute the search for p; making minimal the maximum of the right-hand 
side in (8.17) by the search for p; making minimal 

a (A)|. 8.18 

apax BO) (8.18) 

Observing that p;(1) = 1, determine the solution to this problem using 
Proposition 9.5.3 on Chebyshev polynomials. 
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6. Use relation (9.12) to establish the following induction relation between 


three consecutive vectors of the sequence (2';) j>1: 


Thi = pj (Bai, E= zh) + Thi» Vj 2 1, (8.19) 


where uj is a real number to be determined. This relation allows for com- 
puting xi; directly without having previously computed the vectors xz, 
provided that the real sequence (uj); can be computed. 

7. Compute uo and u1. Express aa in terms of uj. Prove that uj € (1,2) 
for all j > 1. Check that the sequence (uj); converges. 

8. Programming the method. We consider the Laplacian in two space di- 
mensions with a right-hand side f(x,y) = cos(x) sin(y); see Exercise 6.7. 
Compare the Jacobi method and the accelerated Jacobi method for solv- 
ing this problem. Take n = 10, a = 0.97, and plot on the same graph the 
errors (assuming that the exact solution is A\b) of both methods in terms 
of the iteration number (limited to 50). 


9 


Conjugate Gradient Method 


From a practical viewpoint, all iterative methods considered in the previous 
chapter have been supplanted by the conjugate gradient method, which is 
actually a direct method used as an iterative one. For simplicity, we will 
restrict ourselves, throughout this chapter, to real symmetric matrices. The 
case of complex self-adjoint matrices is hardly more difficult. However, that 
of non-self-adjoint matrices is relatively more delicate to handle. 


9.1 The Gradient Method 


Often called the “steepest descent method,” the gradient method is also known 
as Richardson’s method. It is a classical iterative method with a particular 
choice of regular decomposition. We recall its definition already presented in 
Example 8.1.1. 


Definition 9.1.1. The iterative method, known as the gradient method, is 
defined by the following regular decomposition: 
1 1 
M=-—I, and N= (An — a) : 
a a 


where a is a real nonzero parameter. In other words, the gradient method 
consists in computing the sequence xz defined by 


Xo given in R”, 
geti — gk + a(b- Azt), Vk>1. 
For the implementation of this method, see Algorithm 9.1. 
Theorem 9.1.1. Let A be a matrix with eigenvalues 1 < A2 < +++ < Xn. 


(i) If à <0 < An, then the gradient method does not converge for any value 


of a. 
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Data: A, b. Output: x (approximation of x) 
Initialization: 
choose a. 
choose x € R”. 
compute r = b— Ax. 
While ||r||2 > e|lbll2 
x=xt+ar 
r=b—Ax 
End While 
Algorithm 9.1: Gradient algorithm. 


(ü)lf 0 <A < +++ < An, then the gradient method converges if and only if 
0 <a < 2/Xy. In this case, the optimal parameter a, which minimizes 
o(M-"N), is 


o 2 
Qopt = i ae 


E An— À  conda(A)—1 
j 1 = n 1 = 2 
a OS ae 


Remark 9.1.1. Note that if the matrix A is diagonalizable with eigenvalues 
Ai L<- < An < 0, then a symmetric result of (ii) occurs by changing a into 
—a. On the other hand, the conditioning of a normal invertible matrix A is 
cond2(A) = A,,/A1. Thus, for the optimal parameter a = Qopt, the spectral 
radius of the iteration matrix, e(M~!N) = et eae is an increasing func- 
tion of cond2(A). In other words, the better the matrix A is conditioned, the 
faster the gradient method converges. 


Proof of Theorem 9.1.1. According to Theorem 8.1.1, we know that the 
gradient method converges if and only if o(M~!N) < 1. Here, MIN = 
(In — aA); hence 


o(M-*N) <14> |1- aà;| <1 —$-1<1-aX\<1, Vi. 


This implies that aA; > 0 for all 1 <i < n. As a result, all eigenvalues of 
A have to be nonzero and bear the same sign as a. Therefore, the gradient 
method does not converge if two eigenvalues have opposite signs, whatever 
the value of a. If, on the other hand, we have 0 < Ay <--: < An, then we 
deduce p 
—1 < 1— aùn > a<—. 
An 
To compute the optimal parameter aopt, note that the function A > |1 — aA| 
is decreasing on | — oo, 1/a] and then increasing on [1/a, +oo[; see Figure 9.1. 
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Fig. 9.1. Eigenvalues of the matrix Ba = I — aA. 


Thus 

o(M—"N) = max{|1 — aàı|, |1 — aAn|}. 
The piecewise affine function a € [0, 2/An] => e(M~!N) attains its minimum 
at the point Qop_ defined by 1 — @optA1 = AoptAn — 1, i.e., Qopt = At 


this point we check that o(M~1N) = sae 


Artn ` 


9.2 Geometric Interpretation 


This section provides an explanation of the name of the gradient method. To 
this end, we introduce several technical tools. 


Definition 9.2.1. Let f be a function from R” into R. We call the vector of 
partial derivatives at the point x the gradient (or differential) of the function 
f atx, which we denote by 


Vi(2) = (Hm... 2A). 


We recall that the partial derivative 2L (x) is computed by differentiating 


f(x) with respect to x; while keeping the other entries z;, j # i, constant. 
We consider the problem of minimizing quadratic functions from R” into 
R defined by 


1 1 n n 
f(x) = 3 Az, 2) — (b,x) = 5 5 Aijt — X bizi, (9.1) 
ij=1 i=1 


where A is a symmetric matrix in M,,(R) and bis a vector in R”. The function 
f(a) is said to have a minimum (or attains its minimum) at zo if f(x) > f(x) 
for all x € R”. Figure 9.2 shows the surfaces x € R? +> (A;2,2x) for each of 
the symmetric matrices 
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A=(>, F iel, ae (9.2) 


The matrix A; has eigenvalues 2 and 8; it is therefore positive definite. The 
matrix A» has eigenvalues —2 and 4, so it is not positive (or negative!). 


irae 


1 012 a) -1 0 


Fig. 9.2. Surfaces x € R? œ (Ax, x), for Aı (left) and As (right), see (9.2). 


Proposition 9.2.1. The gradient of the function f(x) defined by (9.1) is 
Vf(z) = Av — b. 
Moreover, 


(i) if A is positive definite, then f admits a unique minimum at xo that is a 
solution of the linear system Ax = b; 

(ü) if A is positive indefinite and if b belongs to the range of A, then f attains 
its minimum at all vectors xo that solve the linear system Ax = b and at 
these vectors only; 

(iii) in all other cases, that is, if A is not positive or if b does not belong to 
the range of A, f does not have a minimum, i.e., its infimum is —oo. 


Proof. We compute the kth partial derivative of f, 
of 


1 1 
Bae = akktk + 5 do arritz) akiti — bk = X dikti — dy = (Ax—b)k, 


ifk iżk i 

thereby easily deducing that V f(x) = Aa — b. Since A is real symmetric, 
we can diagonalize it. Let (A;) and (é;) be its eigenvalues and eigenvectors. 
Setting x = )°, ĉ;êĉê; and b = )°, b;é;, we have 
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If A is positive definite, then A; > 0 for all 1 < į < n. In this case, we minimize 
f(x) by minimizing each of the squared terms. There exists a unique minimum 
x whose entries are ĉ; = 2, and this vector is the unique solution to the 
system Ax = b. 

If A is only positive indefinite, then there exists at least one index ip such 
that Ai = 0. In this case, if bi, 4 0, taking « = —tb;,é;,, we obtain 


lim f(a) =—oo, 


t— +00 


so f is not bounded from below. If b; = 0 for all indices 7 such that A; = 0, 
then b € Im (A). Let A be the restriction of A to Im (A). Hence the minimum 
of f is attained at all points £ = A~!b+ y, where y spans Ker (A). 

Finally, if A admits an eigenvalue A; < 0, then, taking x parallel to é;,, 
we easily see that f is not bounded from below. 


Corollary 9.2.1. Let A be a real positive definite symmetric matrix. Let f(x) 
be the function defined by (9.1). Let F be a vector subspace of R”. There exists 
a unique vector xo E F such that 


f(o) < f(z), Vee F. 
Furthermore, xo is the unique vector of F such that 
(Arp —b,y) =0, VyEF. 


Proof. Denote by P the orthogonal projection on the vector subspace F in 
R”. We also denote by P its matrix representation in the canonical basis. By 
definition of the orthogonal projection we have P* = P, P? = P, and the 
mapping P is onto from R” into F; consequently, 


min f(x) = ming (Py). 


We compute f(Py) = 3(P*APy,y) — (P*b,y). Proposition 9.2.1 can be ap- 
plied to this function. Since A is positive definite, we easily see that P* AP is 
nonnegative. If P*b did not belong to the range of (P* AP), then we would 
infer that the infimum of f(Py) is —oo, which is impossible, since 
f f(Py) = inf > — 
ea f(Py) = inf f(x) 2 min f(x) > —oo, 
because A is positive definite. Thus P*b € Im(P*AP) and the minimum of 
f(Py) is attained by all solutions of P*APx = P*b. Let us prove that this 


equation has a unique solution in F. Let xı and x2 be two solutions in F of 
P* APx = P*b. We have 


P* AP(a — z2) = 0 => (A(Pa, — P22), (P21 — Pxe)) = 0 > Px, = Px. 
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Since x; and zz belong to F, we have zı = Px, = Pz = zə. Multiplying 
the equation P* APxp = P*b by y € F and using the fact that Px 9 = zo and 
Py = y, because x,y € F, yields the relation (Axo — b, y) = 0 for all y in F. 


Theorem 9.2.1. Let A be a positive definite symmetric matrix, and let f be 
the function defined on R” by (9.1). 

(i) « € R” is the minimum of f if and only if V f(x) =0 

(ii) Let x € R” be such that V f(x) #0. Then Va € (0,2/0(A)), we have 


f(x -aV f(x)) < f(x). 


Remark 9.2.1. From this theorem we infer an iterative method for minimizing 
f. We construct a sequence of points (a), such that the sequence (f(xx))x 
is decreasing: 


Leet = Tk — QV f (£k) = £k + a(b — Avg). 


This is exactly the gradient method that we have studied in the previous sec- 
tion. In other words, we have shown that solving a linear system whose ma- 
trix is symmetric and positive definite is equivalent to minimizing a quadratic 
function. This is a very important idea that we shall use in the sequel of this 
chapter. 


Proof of Theorem 9.2.1. We already know that f attains its minimum at a 
unique point x that is a solution of Ax = b. At this point x, we therefore have 
V f(x) = 0. Next, let x be a point such that V f(a) 4 0 and set 6 = —a( Ax—b). 
Since A is symmetric, we compute 


(A5, 5) + (Ax — b, ô). 


1 
2 
Now, (Aô, ô) < ||All2||6||?, and for a symmetric real matrix, we have ||Al|2 = 


o( A). Thus 


Fæ +8) < f(x) + (a7 0(A)/2 — a) || Ax — bll’. 


As a consequence, f(x +6) < f(a) if 0 <a < 2/0(A). 


9.3 Some Ideas for Further Generalizations 


We can improve the gradient method with constant step size, which we have 
just discussed, by choosing at each iterative step a different coefficient a, that 
minimizes f (£k —aVf(xx)). 
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Definition 9.3.1. The following iterative method for solving the linear system 
Ax = b is called the gradient method with variable step size: 


zo given in R”, 
g*tl = gk + ap(b — Ax") Vk>1, 


where az is chosen as the minimizer of the function 


g(a) = f (£k — aV f (2x)) , 
with f(x) defined by (9.1). 


The optimal value a, is given by the following lemma, the proof of which is 
left to the reader as an easy exercise. 


Lemma 9.3.1. Let A be a positive definite symmetric matrix. For the gradient 

method with variable step size, there exists a unique optimal step size given by 
| Azz — b||? 

(A(Azxp — b), (Azk — b)) 


Qk = 


We observe that a, is always well defined except when Ax, — b = 0, in which 
case the method has already converged! The gradient method with optimal 
step size is implemented in Algorithm 9.2. 


Data A, b. Output: x (approximation of x) 
Initialization: 
choose x € R”. 
compute r = b — Ax. 


Is Ill 
~~ (Ar,r) 


While ||r||2 > elļbl|2 


r=b-— Ax 
= A optimal step size 
End While 


Algorithm 9.2: Gradient algorithm with variable step size. 


The gradient method with variable step size is a bit more complex than 
the one with constant step size (more operations are needed to compute ag), 
and in practice, it is not much more efficient than the latter. In order to 
improve the gradient method and construct the conjugate gradient method, 
we introduce the important notion of Krylov space. 


Definition 9.3.2. Let r be a vector in R”. We call the vector subspace of R” 
spanned by the k + 1 vectors {r, Ar,...,A*r}. The Krylov space associated 
with the vector r (and matrix A), denoted by K;,(A,r) or simply Kp. 


170 9 Conjugate Gradient Method 


The Krylov spaces (A,)x>0 form by inclusion an increasing sequence of 
vector subspaces. Since Kp C R”, this sequence becomes stationary from a 
certain k. Namely, we prove the following result. 


Lemma 9.3.2. The sequence of Krylov spaces (Kp)k>o is increasing, 
Ky C Kpy Vk > 0. 
Moreover, for all vectors ro #0, there exists ko € {0,1,...,n — 1} such that 


dim Ky =k+1 if0< k< ko, 


This integer ko is called the Krylov critical dimension. 


Proof. It is clear that dim Kẹ <k+1 and dim Kọ = 1. Since dim Kk < n 
for any k, there exists ko that is the greatest integer such that dim Ky, = k+1 
for any k < ko. By definition of ko, the dimension of Kkọ+1 is strictly 
smaller than kp + 2. However, since K,, C Kko+1, we necessarily have 
dim Ky,41 = ko + 1, and the vector A*°++r9 is a linear combination of the 
vectors (ro, Aro,..., Aro). Thereby we infer by a simple induction argument 
that all vectors A*ro for k > kg +1 are also linear combinations of the vectors 
(ro, Aro,..., Arq). Accordingly, Kk = Kx, for all k > ko. 


Proposition 9.3.1. We consider the gradient method (with constant or vari- 
able step size) 
t E€ R” initial choice, 
Lk+1 = Tk + ap(b _ Ax). 


The vector ry, = b— Azp, called the residual, satisfies the following properties: 


1. ry belongs to the Krylov space Ky, corresponding to the initial residual ro. 
2. zk+ı belongs to the affine space [vo + Kp] defined as the collection of 
vectors x such that x — xo belongs to the vector subspace Kp. 


Proof. By definition we have 7,41 = £k + Qkrk, which, by multiplication by 
A and subtraction to b, yields 


Tki = Tk Qk Årg. (9.3) 


By induction on (9.3), we easily infer that rẹ € span {ro, Aro, 7 ., APro}. 
Then, a similar induction on £k+1 = £k + apr, shows that xp+1 € [£o + Ky]. 


Lemma 9.3.3. Let (£k)k>0 be a sequence in R”. Let Kp be the Krylov space 
relative to the vector ro = b — Azo. If £k4}1 E [£o + Kz], then rk}1 = (b — 
Azk+1) E Key. 
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Proof. If xp41 € [£o + Ky], there exist coefficients (a;)o<i<~ such that 


k 
Tk+1 = To + a; A'ro. 
i=0 


We multiply this equation by A and we subtract to b, which yields 


k 
X j itl] 
Tki = To = a; A To- 


i=0 


Hence rk4}1ı € Kk41- 


9.4 Theoretical Definition of the Conjugate Gradient 
Method 


We now assume that all matrices considered here are symmetric and positive 
definite. To improve the gradient method, we forget, from now on, the induc- 
tion relation that gives £kķ+1 in terms of z, and we keep as the starting point 
only the relation provided by Proposition 9.3.1, namely 


Tk+1 € [xo + Kx], 


where Ky, is the Krylov space relative to the initial residual rọ = b — Azo (xo 
is some initial choice). Of course, there exists an infinity of possible choices 
for 2,41 in the affine space [xo + K]. To determine 2,41 in a unique fashion, 
we put forward two simple criteria: 


e ist definition (orthogonalization principle). We choose xp41 € 
[xo + Kx such that Tk+1 a Ky. 

e 2nd definition (minimization principle). We choose 7,41 € [£o + Kx] 
that minimizes in [ao + Ky] 


f(a) = 5 (4z, 2) — (6,2). 


Theorem 9.4.1. Let A be a symmetric positive definite matrix. For the two 
above definitions, there exists indeed a unique vector p41 € [ao + Kx]. Both 
definitions correspond to the same algorithm in the sense that they lead to the 
same value of «p41. Furthermore, this algorithm converges to the solution of 
the linear system Ax = b in at most n iterations. We call this method the 
“conjugate gradient method.” 


Remark 9.4.1. The previous theorem shows that the conjugate gradient algo- 
rithm that we have devised as an iterative method is in fact a direct method, 
since it converges in a finite number of iterations (precisely kg + 1, where kg is 
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the critical Krylov dimension defined in Lemma 9.3.2). However, in practice, 
we use it like an iterative method that (hopefully) converges numerically in 
fewer than ko + 1 < n iterations. 

Intuitively, it is easy to see why the conjugate gradient improves the simple 
gradient method. Actually, instead of merely decreasing f(a) at each iteration 
(cf. Theorem 9.2.1), we minimize f(x) on an increasing sequence of affine 
subspaces. 


Remark 9.4.2. Introducing r = b — Ax, we have 
h(r) = (Atr, r) /2 = f(x) + (Atb, b) /2. 


Thanks to Lemma 9.3.3, the second definition is equivalent to finding v.41 € 
[vo + Kp] such that its residual rp414 = b— Axp41 minimizes the function h(r) 
in Kyat : 


Proof. First, let us prove that the two suggested definitions are identical and 
uniquely define 7,4 . For any y € Kx, we set 


gly) = Fao + 9) = 5 (Avy) — (roo) + F(20). 


Minimizing f on {ao + Kx] is equivalent to minimizing g on Kp. Now, by 
Corollary 9.2.1, g(y) admits a unique minimum in Kg, which we denote by 
(£k+1 — zo). As a consequence, the second definition gives a unique £k4+1- 
Furthermore, Corollary 9.2.1 also states that 


(Ates1—b,y) =0 Vy € Kg, 


which is nothing else but the definition of rk+1LKp. The two algorithms are 
therefore identical and yield the same unique value of 741. 

If the critical dimension of the Krylov space ko is equal to n — 1, then 
dim Kx, = n, and the affine subspace [29+ K;,] coincides with R” entirely. Ac- 
cordingly, £ko+1 = £n is the minimum of f on R” that satisfies, by Proposition 
9.2.1, Ax, = b. The gradient method has thereby converged in n iterations. 

If ko < n — 1, then by virtue of Lemma 9.3.2, for all k > ko, dim Kk = 
ko +1. In particular, A*°+14rg € Kpọ, which means that A*°+1r9 is a linear 


combination of vectors (ro, Aro,..., A®ro), 
ko 
Aotita = y Qa; A ro 
i=0 


The coefficient ag is inevitably nonzero. As a matter of fact, if this were 
not true, we could multiply the above equation by AT! (we recall that A 
is nonsingular, since it is positive definite) and show that A’°ro is a linear 
combination of vectors (ro, Aro, ..., 4*°~!r9), which would imply that Kpọ = 
Kko—1, contradicting the definition of the critical dimension ko. Since ro = 
b — Azo, we get 
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ko 
1 Qi 4; 
A Ako > t Atiro 4 =b; 
(2 ro AE TO z) 


It turns out from this equation that the solution of Ax = b belongs to the 
affine space [% + Ky]. Hence, at the (ko + 1)th iteration, the minimum of 
f(x) on [xo + Kko| happens to be the minimum on all of R”, and the iterate 
Zko+1 İs nothing but the exact solution. The conjugate gradient method has 
thus converged in exactly ko + 1 < n iterations. 


The definition of the conjugate gradient method that we have just given is 
purely theoretical. Actually, no practical algorithm has been provided either 
to minimize f(x) on [zo + Ky] or to construct rg+}ı orthogonal to Ky. It 
remains to show how, in practice, we compute £k+1. To do so, we will make 
use of an additional property of the conjugate gradient method. 


Proposition 9.4.1. Let A be a symmetric, positive definite matrix. Let 
(£k)o<k<n be the sequence of approximate solutions obtained by the conjugate 
gradient method. Set 


Tk =b— Azk and dk = Zk41— Tk. 


Then 
(i) the Krylov space Ky, defined by Kp = span {ro, Aro, ssa Afro}, satisfies 


Ky = span{ro,..., Tr} = span{do,..., dx}, 
(ii) the sequence (rp)o<k<n—1 is orthogonal, i.e., 
(rk ori) =0 for all0<l<k<n-1, 


(iii) the sequence (dk)o<k<n—ı is “conjugate” with respect to A, or “A- 
conjugate,” i.e., 


(Adz, di) =0 for all0<l<k<n-1. 


Remark 9.4.3. A conjugate sequence with respect to a matrix A is in fact 
orthogonal for the scalar product defined by (Az, y) (we recall that A is sym- 
metric and positive definite). This property gave its name to the conjugate 
gradient method. 


Proof of Proposition 9.4.1. Let us first note that the result is independent of 
the critical dimension kp, defined by Lemma 9.3.2. When k > ko + 1, that is, 
when the conjugate gradient method has already converged, we have rg = 0 
and zk = Zko+1; thus dy = 0. In this case, the sequence r;, is indeed orthogonal 
and dp is indeed conjugate for k > ko +1. When k < ko, the first definition 
of the conjugate gradient implies that rk}ı E€ Kk+1 and rk4}1L Kp. Now, 
Ky GC Kyat and dim Ky =k+1 fork < ko. Therefore the family (rk )o<k<ko 
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is free and orthogonal. In particular, this entails that Kẹ = span {ro,...,1rx} 
for all k > 0. 

On the other hand, dk = £k41 — £p, with £k € [vo + Kp-1] and £k41 € 
[xo + Ky], implies that dp belongs to Kp for all k > 1. As a consequence, 


span {do,..., dk} C Kp. 


Let us show that the family (dk)1<k<ko is conjugate with respect to A. For 
l< k, we have 


(Adk, di) = (A£k+1 — Arg, di) = (rk — Tk+1, di) = 0, 


since d; € Kı = span {ro,..., rı} and the family (r) is orthogonal. We de- 
duce that the family (dk)o<k<ko is orthogonal for the scalar product (Ax, y). 
Now, dk # 0 for k < ko, because otherwise, we would have x41 = Tk 
and accordingly rpi41 = rk 4 0, which is not possible since rk4}ı € Key 
and rx411k,. An orthogonal family of nonzero vectors is free, which implies 
Ky = span {do,..., dx}. 


9.5 Conjugate Gradient Algorithm 


In order to find a practical algorithm for the conjugate gradient method, we 
are going to use Proposition 9.4.1, namely that the sequence dy = k41 — Tk 
is conjugate with respect to A. Thanks to the Gram-Schmidt procedure, we 
will easily construct a sequence (pp) conjugate with respect to A that will be 
linked to the sequence (dp) by the following result of “almost uniqueness” of 
the Gram-Schmidt orthogonalization procedure. 


Lemma 9.5.1. Let (ai)i<i<p be a family of linearly independent vectors of 
R”, and let (bi)i<i<p and (ci)i<i<p be two orthogonal families for the same 
scalar product on R” such that for all 1 < i < p, 


span {a1,..., ai} = span {b1,..., bi} = span {c1,..., Ci}. 
Then each vector b; is parallel to c; for1 < i< p. 


Proof. By the Gram-Schmidt orthonormalization procedure (cf. Theorem 
2.1.1), there exists a unique orthonormal family (d;)1<;<, (up to a sign change) 
such that 


span {a1,...,a;} = span {d1,..., di}, VI <i<p. 


Since the family (a;) is linearly independent, the vectors b; and c; are never 
zero. We can thus consider the orthonormal families (b;/||b;||) and (c;/|lc:||), 
which must then coincide up to a sign change. This proves that each b; and 
ci differ only by a multiplicative constant. 
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We will apply this result to the scalar product (A-,-) and to the fam- 
ily (dk)o<k<ko, Which is orthogonal with respect to this scalar product. Ac- 
cordingly, if we can produce another sequence (px )o<k<ky that would also be 
orthogonal with respect to (A-,-) (for instance, by the Gram-Schmidt proce- 
dure), then for all 0 < k < ko there would exist a scalar a, such that 


Tk+1 = Tk + Apr. 
More precisely, we obtain the following fundamental result. 


Theorem 9.5.1. Let A be a symmetric, positive definite matrix. Let (xp) be 
the sequence of approximate solutions of the conjugate gradient method. Let 
(rk = b — Azp) be the associated residual sequence. Then there exists an A- 
conjugate sequence (pp) such that 


Tk+1 = Tk + QkPk, 
Po = ro = b — Axo, and forO < k < ko, $ Tk+1 = Tk — QkÅPk, (9.4) 
Pk+1 = Tk+1 + ÊkPk, 


with Irall2 
Tk 
an = ———— and bk= 
(Apr, Pk) 
Conversely, let (£k, rk, pk) be three sequences defined by the induction relations 
(9.4). Then (ap) is nothing but the sequence of approximate solutions of the 
conjugate gradient method. 


lre+ll? 
Ilr I? 


Proof. We start by remarking that should (xp) be the sequence of approxi- 
mate solutions of the conjugate gradient method, then for all k > kọ + 1 we 
would have £k = £kọ+1 and rg = 0. Reciprocally, let (£k, rk, pk) be three se- 
quences defined by the induction relations (9.4). We denote by ko the smallest 
index such that rkọ+1 = 0. We then easily check that for k > ko + 1 we have 
Tk = Cko41 and rk = pk = 0. As a consequence, in the sequel we restrict 
indices to k < ko for which rz, 4 0 (for both methods). 

Consider (z), the sequence of approximate solutions of the conjugate 
gradient method. Let us show that it satisfies the induction relations (9.4). 
We construct a sequence (pp), orthogonal with respect to the scalar product 
(A.,-), by applying the Gram-Schmidt procedure to the family (rg). We define, 
in this fashion, for 0 < k < ko, 


k—1 
Pk =Tk + ` Bi,kPj, 
j=0 
with 
(Ark, pi) 


k= e 
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Applying Lemma 9.5.1, we deduce that, since the sequences (p) and (dp = 
Zk+1 — LE) are both conjugate with respect to A, for all 0 < k < ko there 
exists a scalar a, such that 


Tk+1 = Tk + AkDe- 


We deduce 


(Arp, D;) = (rk, Ap;) = (re, Alea — “i = z (Tk Tj Tai (9.6) 


Qj Qj 


on the one hand. On the other hand, we know that the sequence r; is orthog- 
onal (for the canonical scalar product). Therefore (9.5) and (9.6) imply 


0 if0<j<k-2, 
fik = rel? ifj=k-1. 


Q,—1(APR—-1;Pe—1) 
In other words, we have obtained 
Pk = Tk + Be—-1pr—1 With Bp_1 = Br—-1,k- 
Moreover, the relation 7441 = £k + @xp, implies that 
Tk41 = Tk — Qk ÁpPk. 


We have hence derived the three induction relations (9.4). It remains to find 
the value of œg. To do so, we use the fact that rk+ı is orthogonal to rg: 


0= Irl? = an (Apr, Tk) 


Now, rk = pk — Be—1pr—1 and (Apk, pe—1) = 0, and thus 


og irl? 
(Apr, Pk) 
= rel? 
Tk 
Bk-1 = ——. 
İrsi 


Conversely, consider three sequences (£k, rk, pk) defined by the induction 
relations (9.4). It is easy to show by induction that the relations 


r = Tk — Qk- ÅPk 
ro =b— Azo and ann KS Pis 
Tk+1 = Tk + QkPk, 


imply that the sequence rz is indeed the residual sequence, namely rk = 
b — Ax,. Another elementary induction argument shows that the relations 


Tk = Tei — Qk—-1ÁPk-1, 
To = po and 
Ome tn = rk + Be—-1Pk-1, 
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imply that p, and rọ belong to the Krylov space K,, for all k > 0. Asa 
result, the induction relation xp41 = £k + Qkpk entails that x,41 belongs to 
the affine space [xro + Kp]. To conclude that the sequence (xp) is indeed that 
of the conjugate gradient method, it remains to show that r;,+1 is orthogonal 
to Kp (first definition of the conjugate gradient method). To this end, we now 
prove by induction that rı is orthogonal to rj, for all 0 < j < k, and that 
Pr+1 is conjugate to pj, for all 0 < j < k. At order 0 we have 


(71,70) = |lroll* — a0(Apo, ro) = 0, 
since po = ro, and 
(Api, po) = ((71 + Bopo), Apo) = a9 ((rı + Boro), (ro — 71)) = 0. 
Assume that up to the kth order, we have 
(rp, rj) =0 for0< 7 <k—1, and (Apk, pj) =0 forO<j <k-1. 


Let us prove that this still holds at the (k + 1)th order. Multiplying the 
definition of rz41 by rj leads to 


(Tk+1; Tj) = (Tk, Tj) — On (ADR, Tj), 
which together with the relation r; = p; — Bj—-1p;j—1 implies 
(Tk+1;Tj) = (Tk: 75) — Qk (ADE, Pj) + Qkbj-1(APk, Pj-1)- 


By the induction assumption, we easily infer that (rg4i,7;) =O ifj <k-1, 
whereas the formula for a, implies that (rk+1, rg} = 0. Moreover, 


(APk+1; Pj) = (Pk+1, Adj) = (Tk+1, Ap;) + Be(pe, Ap;), 


and since Ap; = (rj — rj+1)/&; we have 


(Apk+1; Pj) = as (ret. (rj — rj+1)) + Br (Pr, Apj). 


For j < k— 1, the induction assumption and the orthogonality of r+}ı (which 
we have just obtained) prove that (Apk+1; pj) = 0. For j = k, we infer 
(Apk+1;, Pk) = 0 from the formulas giving a, and pg. This ends the induc- 
tion. 

Since the family (rg)o<k<ko is orthogonal, it is linearly independent as long 
as re # 0. Now, rg E€ Kg, which entails that K, = span {ro,...,r¢}, since 
these two spaces have the same dimension. Thus, we have derived the first 
definition of the conjugate gradient, namely 


Tk+1 €E [xo + Kx] and rk41LKk. 


Hence, the sequence x, is indeed that of the conjugate gradient. 
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Remark 9.5.1. The main relevance of Theorem 9.5.1 is that it provides a prac- 
tical algorithm for computing the sequence of approximate solutions (xx). It 
actually suffices to apply the three induction relations to (£k, rk, pk) to derive 
Lp41 starting from xp. It is no longer required to orthogonalize rg}ı with 
respect to Kp, nor to minimize $(Az,x) — (b,x) on the space [xo + Kx] (cf. 
the two theoretical definitions of the conjugate gradient method). 

Instead of having a single induction on x, (as in the simple gradient 
method), there are three inductions on (£k, rk, pk) in the conjugate gradi- 
ent method (in truth, there are only two, since rx is just an auxiliary for the 
computations we can easily get rid of). In the induction on px, if we substitute 
py With (ap+1 — £k)/Qk, we obtain a three-term induction for xg: 


arbr-1, 
Qk—1 


Tk Tk—1). 


Tki = Tk + az(b Azp) t 


This three-term induction is more complex than the simple inductions we have 
studied so far in the iterative methods (see Chapter 8). This partly accounts 
for the higher efficiency of the conjugate gradient method over the simple 
gradient method. 


9.5.1 Numerical Algorithm 


In practice, the conjugate gradient algorithm is implemented following Algo- 
rithm 9.3. As soon as rg = 0, the algorithm has converged. That is, x, is the 


Data: A and b. Output: x (approximation of x) 
Initialization: 
choose x € R” 
compute r = b — Ax 
set p=r 
compute 7 = ||r||? 
While ||7|| >£ 
y = Ap 
a= Gay 
x=x-+ap 
r=r— ay 
p= È 
y 
q= lirl? 
p=r+ pp 
End While 


Algorithm 9.3: Conjugate gradient algorithm. 


solution to the system Az = b. We know that the convergence is achieved 
after kg +1 iterations, where ky < n—1 is the critical dimension of the Krylov 
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spaces (of which we have no knowledge a priori). However, in practice, com- 
puter calculations are always prone to errors due to rounding, so we do not 
find exactly rz,41 = 0. That is why we introduce a small parameter £ (for 
instance, 10~4 or 1078 according to the desired precision), and we decide that 
the algorithm has converged as soon as 


Irk 
[roll = 


Moreover, for large systems (for which n and kg are large, orders of magnitude 
ranging from 104 to 10°), the conjugate gradient method is used as an iterative 
method, i.e., it converges in the sense of the above criterion after a number of 
iterations much smaller than ko + 1 (cf. Proposition 9.5.1 below). 


Remark 9.5.2. 


1. In general, if we have no information about the solution, we choose to 
initialize the conjugate gradient method with xp = 0. If we are solving 
a sequence of problems that bear little difference between them, we can 
initialize xq to the previous solution. 

2. At each iteration, a single matrix-vector product is computed, namely 
Ap x, since rg is computed by the induction formula and not through the 
relation rg = b — Arg. 

3. To implement the conjugate gradient method, it is not necessary to store 
the matrix A in an array if we know how to compute the product Ay 
for any vector y. For instance, for the Laplacian matrix in one space 
dimension, we have 


2 -1 0 
a (Ay)1 = 2y1 — yo, 
A= =] 0%. (Ay) n = 2Yn — Un-1; 
| l and for i = 2,...,n—1, 
0 l 4 9 (Ay); = 2y; — Yi+ı — Yi-1- 


4. The conjugate gradient method is very efficient. It has many variants and 
generalizations, in particular to the case of nonsymmetric positive definite 
matrices. 


9.5.2 Number of Operations 


If we consider the conjugate gradient method as a direct method, we can 
count the number of operations necessary to solve a linear system in the most 
unfavorable case, kg = n — 1. At each iteration, on the one hand, a matrix- 
vector product is computed (n? products), and on the other hand, two scalar 
products and three linear combinations of vectors such as x + ay (of the order 
of magnitude of n products) have to be carried out. After n iterations, we 
obtain a number of operations equal, to first order, to 
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Nop © në. 
It is hence less efficient than the Gauss and Cholesky methods. But one must 
recall that it is being used as an iterative method and that generally conver- 
gence occurs after fewer than n iterations. 


9.5.3 Convergence Speed 


Recall that Remark 9.1.1 tells us that the gradient method reduces the error 
at each iteration by a factor of (cond2(A) — 1)/( conda( A) +1). We shall now 
prove that the convergence speed of the conjugate gradient is much faster, 
since it reduces the error at each iteration by a smaller factor (\/ conda (A) — 


1)/(,/ conda (A) + 1). 


Proposition 9.5.1. Let A be a symmetric real and positive definite matriz. 
Let x be the exact solution to the system Ax = b. Let (xx)x be the sequence of 
approximate solutions produced by the conjugate gradient method. We have 


k 
cond ( (A)-1 
-= < 24/ conda( — a 
læk — allo < 2y cond2( m( aaa ae zo — 2ll2 


Proof. Recall that x, can be derived as the minimum on the space [ro+Kp-1] 
of the function f defined by 


=. 


z ER” — f(z) = 5(Az2) — (6,2) = lle elà — (4s, 2), 


2 
where ||y||2, = (Ay, y). Computing xp is thus equivalent to minimizing ||x — 
z||3, that is, the error in the |].||4 norm, on the space [£o + Kk-1]. Now we 
compute |lex||4. Relations (9.4) show that 
k—1 
Tk = To + X a;pj, 
j=0 


with each p; equal to a polynomial in A (of degree < j) applied to po, so there 
exists a polynomial qg—ı € P,—1 such that 


Tk = To + G,_,(A)po- 
Since po = ro = b — Axo = A(x — xo), the errors ex satisfy the relations 
ek = Tk — T = €0 +q,_,(A)po = Q, (A)eo, 


where Q, is the polynomial defined, for all t € R, by Q, (t) = 1—q,_, (t)t. We 
denote by u; an orthonormal basis of eigenvectors of A, i.e., Auj = A;u;, and 
by €o,; the entries of the initial error eg in this basis: eg = Xi eo juj. We 
have 
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lleoll4 = = (eo, Aeg) = Da leo,;l? 


and a 
lezl = IIQ, (eoll = 55 Aj leo QA). 
j=1 


Since the conjugate gradient method minimizes |lex||4, the polynomial Q, 
must satisfy 


jali = 5% leo, jQ, (As)? = au par leo,,Q0Q)/" (9.7) 


j=l 


where minimization is carried in P?, the set of polynomials Q € Px that satisfy 
Q(0) = 1 (indeed, we have Q, € P?). From (9.7) we get an easy upper bound 


llexll4 2 
< ; < 
leo, 1a max 12, (As) pees A IQ, (œ)? }. (9.8) 


The last min-max problem in the right-hand side of (9.8) is a classical and cel- 
ebrated polynomial approximation problem: Find a polynomial p € Pk min- 
imizing the quantity MaXseja,»] |p(x)|. To avoid the trivial zero solution, we 
impose an additional condition on p, for instance, p(8) = 1 for a number 
B ¢ [a,b]. This problem is solved at the end of this chapter, Section 9.5.5. Its 
unique solution (see Proposition 9.5.3) is the polynomial 


Ty (e ) 


Ty (27) 


=g 


(9.9) 


where Tk is the kth-degree Chebyshev polynomial defined for all t € [—1, 1] 
by Tk(t) = cos(k arcos t). The maximum value reached by (9.9) on the interval 
[a,b] is 

1 


pe] 


b—a 


In our case (9.7), we have a = \1,b = àn, 8 = 0. The matrix A being positive 
definite, 8 ¢ [a,b], we conclude that 


min max (9.10) 
QEP? A1S#@LAn 


Observe now that 


nao ; Àn tàı — K1 Z= ‘ 
e A being symmetric, we have $25% = SF, with « = conds(A) = An/A1; 
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e Chebyshev polynomials satisfy (9.16), which we recall here: 


aT, (x) = [e + Va? — 1}* + [e — vz? — 1, 


for all x € (—oo, —1] U [1,+00). Therefore, we infer the following lower 


bound: 
do . we ee 1\? 
pe ES > 
(en (2+2) ) Tk-l1 k—1 1 
yk+1 
> +1) bie) = , 
> (e+ 1) + 2v%) mel 
Combining these results, equality (9.10), and the upper bound (9.8), we obtain 
VEHI eol 


Finally, the proposition follows immediately from the equivalence between the 
norms ||.||2 and |j.||.4: Allel? < lelli < Ania]. 


ews <4( 


It is possible to improve further Proposition 9.5.1 by showing that the 
convergence speed is indeed faster when the eigenvalues of A are confined to 
a reduced number of values. 

We draw from Proposition 9.5.1 three important consequences. First of 
all, the conjugate gradient method works indeed as an iterative method. In 
fact, even if we do not perform the requisite n iterations to converge, the more 
we iterate, the more the error between x and x; diminishes. Furthermore, the 
convergence speed depends on the square root of the conditioning of A, and 
not on the conditioning itself as for the simple gradient method. Hence, the 
conjugate gradient method converges faster than the simple gradient one (we 
say that the convergence is quadratic instead of linear). Lastly, as usual, the 
closer cond2(A) is to 1, the greater the speed of convergence, which means 
that the matrix A has to be well conditioned for a quick convergence of the 
conjugate gradient method. 


9.5.4 Preconditioning 


Definition 9.5.1. Let Ax = b be the linear system to be solved. We call a 
matrix C that is easy to invert and such that cond2(C~'A) is smaller than 
cond2(A) a preconditioning of A. We call the equivalent system C~'Ax = 
C~1b a preconditioned system. 


The idea of preconditioning is that if conda (C71A) < condg(A), the con- 
jugate gradient method converges faster for the preconditioned system than 
for the original one. Of course, the price to pay for this faster convergence is 
the requirement of inverting C. Nevertheless, we recall that it is not necessary 
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to form the matrix (C~!A); we merely successively multiply matrices A and 
C7! by vectors. The problem here is to choose a matrix C that is an easily 
invertible approximation of A. In practice, the preconditioning technique is 
very efficient and there is a large literature on this topic. 

Observe that it is not obvious whether we can choose C in such a way that 
on the one hand, cond(C~'A) < cond(A), and on the other hand, C~!A re- 
mains symmetric positive definite (a necessary condition for the application 
of the conjugate gradient method). For this reason, we introduce a “symmet- 
ric” preconditioning. Let C be a symmetric positive definite matrix. We write 
C = BB* (Cholesky decomposition) and we substitute the original system 
Ax = b, whose matrix A is badly conditioned, by the equivalent system 


Az =b, where A= B-'AB~', b= Bob, and # = B*e. 


Since the matrix A is symmetric positive definite, we can use Algorithm 9.3 
of the conjugate gradient method to solve this problem, i.e., 


Initialization 
choice of ï Xo 
=po= = b — Axo 
TN 
For k > 1 and fk 40 
= Tk=1 
ae ae. a =) 


Žk = tk—1 + Õk—1Pk-1 
fk = fk— ee 1Åðk—ı 
i 
Br- Li Tee 
Pr = Fr + Br—1Pr-1 
End 
We can compute Z in this fashion and next compute x by solving the upper 
triangular system B'ax = ž. In practice, we do not proceed this way in order 
to cut down the cost of the computations. 
Note that residuals are linked by the relation rz, = b— Air = Bo'rp, the 
new conjugate directions are p, = B'p,, and 


(Apx, Bi) = (Apr, pi). 
We thereby obtain the relations 
(Oriat) 
(APk-1, Pk-1) 


EA (C7 lrg, rk) 
(C-1 94-1, 7-1) 
We thus see that the only new operation (which is also the most costly one) is 


the computation of zy = C~'r,z, namely solving the linear system Cz, = rp. 
Knowing the Cholesky decomposition of C, the calculation of each zę can 
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be carried out in n? operations at most. From this observation, we infer a 
preconditioned version of the conjugate gradient algorithm (see Algorithm 
9.4), in which the precise knowledge of the Cholesky factorization of C is no 
longer required. This is Algorithm 9.4, which is used in numerical practice. 


Data: A and b. Output: £k 


Initialization 
choice of xo 
To = b= Axo 
zo =C'ro preconditioning 
Po = 20 
Iterations 


For k > 1 andr; 40 
(2e-1)Tk-1) 
(App—1-Pk—1) 
Tk = Tk—1 + Qk—1Pk—1 


Tk = Tk-1 — Qk 1Åpk i 


Ak-1 = 


zk = Crp, preconditioning 
— __(kTk) 

Bri = (2k—1,Tk—-1) 

Pk = Zk + Be—-1Pr—1 


End 


Algorithm 9.4: Preconditioned conjugate gradient method. 


SSOR preconditioning 


To illustrate the interest in preconditioning the conjugate gradient method, 
we consider again the matrix given by the finite difference discretization of the 
Laplacian, which is a tridiagonal, symmetric, positive definite matrix A, € 
M,—1(R), defined by (5.12). For simplicity, in the sequel we drop the subscript 
n and write A instead of A,,. We denote by D the diagonal of A and by —E its 
strictly lower triangular part, A = D — E — Et. For w € (0,2), the symmetric 


matrix p p 
Co (2-£) >" (Z-z) 
2—w \w w 


is positive definite. Indeed, for x 4 0, we have 


(C,2,2) = — (>> (2 -B) 2, (2 -8) =), 


and since D~! is symmetric positive definite and 2 — F is nonsingular, we 
have (Cuz, x} > 0 if and only if 3# > 0. We now precondition the system 


Ax = b by the matrix Cu. We denote by B, = ,/5#-(2 — E)D~\/? the 


w 
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Cholesky factor of C. The matrices A, = B>!ABz* and Cz!A are similar, 
since 7 
Ci 4=8 8,228 (bo aoe js, 4.0 
Hence z 
Xmax Ao) 
Anin(Ay) , 
To evaluate the performance of the preconditioning we determine an upper 


bound for Amax(A.) and a lower bound for Amin(Aw). For any x Æ 0, we have 


cond:(C51A) = cond2(A,,) = 


(Avz,z)  (BZ1ABz'z,z)  (ABz*«, B5'x) 


(z,e) (x, 2) E (x, x) 


Setting x = Bt y, we obtain 


Nant Ne a a, 
240 (2,2)  —-y#0 (Cyy,y) 
Similarly, 
; . (Ay, ¥) 
Amin Ay = TA... NE 
(Ae) = 20 (Coa) 
The goal is now to obtain the inequalities 
A 
o<a< etl <6, vex, (9.11) 
(Cuz, £) 


from which we will deduce the bound cond2(A,,) < 3/a. 


e To obtain the upper bound in (9.11) we decompose C,, as 
Co = A+ —— pe. 
2—w 


with Fi, = 2-1 D) — E. For all z # 0, we have 


2— 
TY ((A-C,)a,2) = —(F D! Ft x, x) = (D7! Ft x, Fix) < 0, 
w 
since D7! is positive definite. We can thus choose 8 = 1. 
e To obtain the lower bound in (9.11), we write (2—w)C,, = A+aD+wG 
with 


_ 2 
Gpp p > pada!” 
4 4w 
For « 4 0, we compute 
(Cu, 2 (Da, x) (Ga, x) 
2 =14 
Ce) (Ax, x i (Ax, x) “TAr, 2) 
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Since (Gz, x) = la? we have 
(Cust) (Dz, x) 2 lel 2an? 
2 <14 —=1+42 <14 
PY aa a agg A) 
1 
We can therefore take a = (2 — w)—_,_,-. 
L+ Sanli) 


We now choose a value of w that minimizes the function 


B 1 2an? 1 2-—wW 
= — => MM — 1 = | ; 
Fw) a 2-w T Amia 2—w 1 w 
where 
_ n? o 1 KA n? 
— 2Xmin 7 8 sin? a = 2r? ` 


A simple computation shows that 
w?(2— w)?f"(w) = w? — (2 - w), 
so the value wopt minimizing f is 


2V2% 2 ~2(1 7 
1+/2y 1+2sinġ ~ n° 


Wopt = 


For this optimal value of w the conditioning of A,, is bounded above as follows: 


X 1 1 
cond2(Ay) < f(Wopt) = 3t 2sin = 
2n 


a 4 2 
Thus for n large enough, cond :(A,,) < whereas cond2(A) ~ us We save 
T T 


one order of magnitude in n by preconditioning the initial linear system. 


9.5.5 Chebyshev Polynomials 


This section is devoted to some properties of Chebyshev polynomials used pre- 
viously, in particular in the proof of Proposition 9.5.1. Chebyshev polynomials 
are first defined on the interval [—1, 1], then extended to R. 


Study on |—1, 1] 
On this interval, the Chebyshev polynomial Tn (n > 1) is given by 
Tn (t) = cos(n0), 


where 0 € [0,7] is defined by cos(0) = t. We easily check that To(t) = 1, 
Ti (t) = t, and 


Taya (t) = 2tT,(t)-Tri(t), Vn >1, (9.12) 


which proves that Tn is a polynomial of degree n and can be defined on the 
whole of R. We also check that T„ has exactly 


9.5 Conjugate Gradient Algorithm 187 


e n zeros (called Chebyshev points), which are 


tk = cos(0;), O, = 5 tk, 0<k<n-l, 


e n+1 extremal points where the polynomial takes its extremal values (—1 
and 1), which are 
tk =cos(ka/n), O<k<n. 


The first five Chebyshev polynomials are represented in Figure 9.3. Expressing 


-1 -0.5 0 0.5 1 


Fig. 9.3. Chebyshev polynomials Tn, for n = 0,...,4 


the cosine in exponential form, we have 
2T,,(t) = e? + e7”? = (cos(0) + isin(0))” + (cos(#) — isin(0))” 


Since 6 € [0,7], we have sin@ = v1 — ¢?, and Tn can be written explicitly in 
terms of t: 


2T, (t) = (t+ivi =) (t= ivi P) a (9.13) 


For a € R with |a| > 1, we denote by P® the set of polynomials of P, taking 
the value 1 at point a: 


P = {p€ Pa, pla)=1}. 


Proposition 9.5.2. Chebyshev polynomials satisfy the following property: 


= 1 
min max |p(t)|= max |Ta(t)| = 


ay 9.14 
pepe t€[—1,1] te[—1,1] \Tn(a@)| ( ) 


where we have set Ta = Tn/Tn(a). 
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Proof. We know that T;,,(a) 4 0, because all the zeros, or roots, of T, belong 
to ]—1, 1[. Hence Tn is well defined. On the one hand, if a > 1, then T,,(a) > 0, 
since the polynomials T, do not change sign on ]1,+oo[ and T;,(1) = 1. On 
the other hand, if a < —1, then Ta (a) has the same sign as T;,(—1) = (—1)”. 
We shall therefore assume in the sequel, without loss of generality, that a > 1, 
so that T (a) > 0. To prove (9.14), we assume, to the contrary, that there 
exists p € Pn such that 


Ip()| < 


1 
—. “i 
Ea POl < FG ma 


We are going to show that p = Tn, which contradicts our assumption (9.15). 


At each of the n + 1 extremal points t, of Ta, the polynomial q = p — T, 
changes sign, since 

e q(to) = p(to) — T <0, 

e q(tı) = p) + zta > 9, 

e etc. 
In addition, g(a) = 0, so the polynomial q € Pn has at least n + 1 distinct 
zeros, because the tą are distinct and a ¢ [—1,1]. It must therefore vanish 
identically. 


Study on [a,b] 


The induction relation (9.12) shows that the Chebyshev polynomials can be 
defined on the whole real line R. We can also define T,,(x) for x € (—oo, —1] U 
[1, +00) as follows: since |z| > 1, we have v1 — x? = ix? — 1, and by (9. 13), 


2T (£) = [x + Vx? afar = 4)” (9.16) 


To obtain the ee of property i on any interval fa, b], we define the 
linear function y that maps [—1, 1] onto [a,b]: 
[-1,1] — [a,b], 
p:t— r= F + ttt. 


Proposition 9.5.3. For any 8 ¢ [a,b], the Chebyshev polynomials satisfy 


we) 
min max |g(z)| = max 7 (oa) ~ [Tae (AI 


Proof. Since 3 ¢ [a,b], a = y *(8) ¢ [-1,1], and the polynomial p(t) = 
a(y(t)) belongs to P°. By Proposition 9.5.2 we get 


min max |q = min max |q(y(t 
min max jg(x)| = min max, la(e(t) 
= an ey p(t) 
1 
= max |T,(t)| = 


tej- ii] [Talpe] 
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with T,,(t) = T,(t)/Tn(y71(8)), and the result is proved. 


9.6 Exercises 


In these exercises A E€ M, (R) is always a symmetric positive definite matrix. 
We recall that the following two problems are equivalent: 


. 1 
min { f(e) = 5(Az,2) — (b,2)}, (9.17) 
and solve the linear system 

Ax =b. (9.18) 


9.1. The goal of this exercise is to program and study the constant step gra- 
dient algorithm. 


1. Write a program Gradients that computes the minimizer of (9.17) by the 
gradient method; see Algorithm 9.1. 

2. Take n = 10, A=Laplacian1dD(n) ;xx=(1:n)’/(n+t1) ; b=xx.*sin(xx). 
Compute the solution xa of (9.18) obtained using this algorithm. Take 
a = 1074 and limit the number of iterations to Nitey = 10000. The conver- 
gence criterion shall be written in terms of the residual norm which must 
be smaller than e = 1074 times its initial value. How many iterations are 
necessary before convergence? Compare the obtained solution with that 
given by Matlab. 

3. Take now Nitzer = 2000 and £ = 107!°. For a varying from Qmin = 32 x 
10-4 to Qmax = 42 x 1074, by steps of size 1075, plot a curve representing 
the number of iterations necessary to compute xq. Determine numerically 
the value of a leading to the minimal number of iterations. Compare with 
the optimal value given by Theorem 9.1.1. 


9.2. In order to improve the convergence of the gradient algorithm, we now 
program and study the variable step gradient algorithm. At each iteration, 
the step a is chosen equal to the value a, that minimizes the norm of the 
residual rk4}ı = b — Ax,4 1. In other words, a, is defined by 


IVF (£x — aV f (xe) || = inf VF (£x — aV F(z) ||; 


where ||.|| denotes the Euclidean norm ||.||2. 


1. Find an explicit formula for az. 

2. Write a program GradientV that computes the solution of (9.17) by the 
variable step gradient method, see Algorithm 9.2. 

3. Compare both algorithms (constant and variable step) and in particular 
the number of iterations and the total computational time for the same 
numerical accuracy. For this, take the data of Exercise 9.1 with, for the 
constant step gradient, a as the optimal value. 
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9.3 (x). We now program and study the conjugate gradient algorithm. 


1. Write a program GradientC that computes the solution of (9.17) using 
the conjugate gradient method (see Algorithm 9.3). 

2. Compare this algorithm with the variable step gradient algorithm for the 
matrix defined in Exercise 9.1. 

3. Let A and b be defined as follows: 


D 

ll 
eNA 
ww Pow 
Ro pwn 


1 
2 
3|, b= 
4 
5 


oe UUNe 


3 
4 
5 
4 
3 


Solve the system Ar = b using the program GradientC with the ini- 
tial data x = (—2,0,0,0,10)’ and note down the number of iterations 
performed. Explain the result using the above script. Same question for 
xo = (—1,6, 12,0,17)*. 

4. We no longer assume that the matrix A is positive definite, nor that it is 
symmetric, but only that it is nonsingular. Suggest a way of symmetrizing 
the linear system (9.18) so that one can apply the conjugate gradient 
method. 


9.4. Write a program GradientCP that computes the solution of (9.17) by 
the preconditioned conjugate gradient algorithm (see Algorithm 9.4) with the 
SSOR preconditioning (see Section 9.5.4). Compare the programs GradientC 
and GradientCP by plotting on the same graph the errors of both methods in 
terms of the iteration number. Take n = 50 and Nie, = 50 iterations without 
a termination criterion on the residual and consider that the exact solution is 
A\b. 
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Methods for Computing Eigenvalues 


10.1 Generalities 


For a squared matrix A € M,C), the problem that consists in finding the 
solutions A € C and nonzero x € C” of the algebraic equation 


Ag = ÀT (10.1) 


is called an eigenvalue problem. The scalar A is called an “eigenvalue,” and the 
vector x is an “eigenvector.” If we consider real matrices A E€ M,,(IR), we can 
also study the eigenvalue problem (10.1), but the eigenvalue A and eigenvector 
x may not be real and, in full generality, can belong to C or C” respectively. 

In Section 2.3 the existence and theoretical characterization of solutions 
of (10.1) were discussed. On the other hand, the present chapter is devoted 
to some numerical algorithms for solving (10.1) in practice. 

We recall that the eigenvalues of a matrix A are the roots of its charac- 
teristic polynomial det (A — AJ,,). In order to compute these eigenvalues, one 
may naively think that it suffices to factorize its characteristic polynomial. 
Such a strategy is bound to fail. On the one hand, there is no explicit formula 
(using elementary operations such as addition, multiplication, and extraction 
of roots) for the zeros of most polynomials of degree greater than or equal to 
5, as has been known since the work of Galois and Abel. On the other hand, 
there are no robust and efficient numerical algorithms for computing all the 
roots of large-degree polynomials. By the way, there is no special property of 
the characteristic polynomial, since any nth-degree polynomial of the type 


pa(a) = (—1)" (AY aA" $ aad"? $+ tay aA + ain) 


is actually the characteristic polynomial (developed with respect to the last 
column) of the matrix (called the companion matrix of the polynomial) 
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= BY TAD, wae we —An 
L O cirer 0 
A= 0 
0 0 1 0 


Accordingly, there cannot exist direct methods (that is, methods producing 
the result after a finite number of operations) for the determination of the 
eigenvalues! As a result, there are only iterative methods for computing eigen- 
values (and eigenvectors). 

It turns out that in practice, computing the eigenvalues and eigenvectors 
of a matrix is a much harder task than solving a linear system. Nevertheless, 
there exist numerically efficient algorithms for self-adjoint matrices or ma- 
trices having eigenvalues with distinct moduli. However, the general case is 
much more delicate to handle and will not be treated here. Finally, numerical 
stability issues are very complex. The possibility of having multiple eigenval- 
ues seriously complicates their computation, especially in determining their 
corresponding eigenvectors. 


10.2 Conditioning 


The notion of conditioning of a matrix is essential for a good understanding 
of rounding errors in the computation of eigenvalues, but it is different from 
the one introduced for solving linear systems. Consider the following “ill- 
conditioned” example: 


OEE 0e 
LO) gut ke 0 
A=1]0 ; 
0...0 10 


where £ = 107”. Noting that p4(A) = (—1)"(A" — £), the eigenvalues of A are 
the nth roots of 107”, all of which are equal in modulus to 10~!. However, 
had we taken ¢ = 0, then all eigenvalues would have been zero. Therefore, 
for large n, small variations in the matrix entries yield large variations in the 
eigenvalues. 


Definition 10.2.1. Let A be a diagonalizable matrix with eigenvalues A4,...,An; 
and let ||- || be a subordinate matriz norm. The real number defined by 


r(A) = inf cond(P), 
P-1AP=diag (A1,-..,An) 
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where cond(P) = ||P||||P~"|| is the conditioning of the matrix P, is called 
the conditioning of the matrix A, relative to this norm, for computing its 
eigenvalues. 


For all diagonalizable matrices A, we have [’(A) > 1. If A is normal, then 
A is diagonalizable in an orthonormal basis. In other words, there exists a 
unitary matrix U such that U~! AU is diagonal. For the induced matrix norm 
|| - |l2, we know that ||U||2 = 1, so cond2(U) = 1. Accordingly, if A is normal, 
then I>(A) = 1. In particular, the 2-norm conditioning of self-adjoint matrices 
for computing eigenvalues is always equal to 1. 


Theorem 10.2.1 (Bauer—Fike). Let A be a diagonalizable matrix with 


eigenvalues (à1,..., Àn). Let ||- || be a subordinate matrix norm satisfying 
iag (d1,...,dn)|| = dil. 
| diag (ds...) = max Id: 


Then, for every matriz perturbation 6A, the eigenvalues of (A+ A) are con- 
tained in the union of n disks of the complex plane D; defined by 


Di={zEC |z-Al<P(A)|6Al)}, 1<i<n. 

Remark 10.2.1. The hypothesis on the subordinate matrix norm is satisfied 

by all norms || - ||, with p > 1; see Exercise 3.3. 

Proof of Theorem 10.2.1. Let P be a nonsingular matrix satisfying 
PTAP = diag (Qj; 2c, Àn). 


Let us prove that the theorem is still valid if we substitute (A) by cond(P), 
which implies the desired result when P varies. Let A be an eigenvalue of 
(A + ôA). If for some index i, we have \ = ,;, then we are done. Otherwise, 
the matrix diag (A; —A,...,An — A), denoted by diag (A; — A), is nonsingular, 
so 
P~1(A+6A—AI)P = diag (A; — A) + P7'6AP (10.2) 
= diag (A; — A) 1+B), (10.3) 
where the matrix B is defined by 
B = diag (A; — A)~'P~‘6AP. 


Assume that ||B|| < 1. Then, by Proposition 3.3.1 the matrix (J + B) is 
invertible and equality (10.3) is thereby contradicted, since the matrix (A + 
ĝA — XI) is singular. Consequently, we infer that necessarily ||B|| > 1, and 


1 < ||Bl| < || diag (A; = A) THPT ISAI LPI] 
Thanks to the hypothesis on the norm of a diagonal matrix, we deduce 


min |A; — A| < cond(P)||d All, 


1<i< 


which ends the proof. 
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10.3 Power Method 


The simplest method to compute eigenvalues and eigenvectors of a matrix 
is the power method. In practice, this method is confined to the computa- 
tion of some (not all) extreme eigenvalues, provided that they are real and 
simple (their algebraic multiplicity is equal to 1). For the sake of simplicity 
we shall now restrict our attention to real matrices only. Let A be a real 
matrix of order n, not necessarily symmetric (unless otherwise mentioned). 
We call (A1,...,An) its eigenvalues repeated with their algebraic multiplicity, 
and sorted in increasing order of their modulus |A] < |A2| < ++: < |An|- 
Henceforth, we denote by ||- || the Euclidean norm in R”. 

The power method for computing the largest eigenvalue An is defined by 
Algorithm 10.1. In the convergence test, € is a small tolerance, for instance 


Data: A . Output: a % An largest (in modulus) eigenvalue of A 
Tk © Un eigenvector associated with An 
Initialization: 
choose xo E€ R” such that ||zo]| = 1. 
Iterations 
For k > 1 and ||a, — xg_1|| < € 
Yk = Axp-1 
Tk = Yk/|lyell 
End 
a = |lyrll 


Algorithm 10.1: Power method. 


e = 107°. If 6, = £k — xp_1 is small, then x, is an approximate eigenvector 
of A associated with the approximate eigenvalue ||yx|| since 


Ary — |lyxlltx = Ady. 


Theorem 10.3.1. Assume that A is diagonalizable, with real eigenvalues 
(A1,.--,;An) associated to real eigenvectors (€1,...,€n), and that the eigen- 
value with the largest modulus, denoted by An, is simple and positive, i.e., 


lài] < re < Anil < Ane 


Assume also that in the eigenvectors’ basis, the initial vector reads x9 = 
iL, biei with Bn #0. Then, the power method converges, that is, 


lim ||y¢|| = An, lim @p=2%o, Where Loo = Een: 


Moreover, the convergence speed is proportional to the ratio |An—1|/|An|: 


a) (Hl)! 
<C ( , and ||x_p —2oo|| <C . 
Dal I I Dal 


yell — An 
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Remark 10.3.1. The assumption that the largest eigenvalue is real is crucial. 
However, if it is negative instead, we can apply Theorem 10.3.1 to —A. In 
practice, if we apply the above algorithm to a matrix whose largest-modulus 
eigenvalue is negative, it is the sequence (—1)*z, that converges, while —||yx || 
converges to this eigenvalue. Let us recall that the simplicity condition on the 
eigenvalue Ap expresses the fact that the associated generalized eigenspace is 
of dimension 1, or that A, is a simple root of the characteristic polynomial. 
Should there be several eigenvalues of maximal modulus, the sequences ||yx|| 
and x, will not converge in general. Nevertheless, if the eigenvalue An is mul- 
tiple but is the only one with maximum modulus, then the sequence ||yx || 
always converges to A,, (but the sequence x, may not converge). 


Proof of Theorem 10.3.1. The vector x, is proportional to A*xg, and be- 
cause of the form of the initial vector xo, it is also proportional to 


which implies 


k 
=i i 
Bnen We ya Bi (=) Ci 


Note in passing that since 6, 4 0, the norm ||yx|| never vanishes and there 
is no division by zero in Algorithm 10.1. From the assumption |A;| < A, for 
i Æ n, we deduce that x, converges to one Likewise, we have 


o 


n—1 ` k+1 
Bn€n + a Bi (+) ej 


lykill = An 


? 


k 
Bn€n + ~ Bi (=) Ei 


which therefore converges to An. 


It is possible to compute the smallest eigenvalue (in modulus) of A by 
applying the power method to A7™!: it is then called the inverse power method; 
see Algorithm 10.2 below. The computational cost of the inverse power method 
is higher, compared to the power method, because a linear system has to be 
solved at each iteration. If 6, = £k—zk—1 is small, then x7,_ 1 is an approximate 
eigenvector associated to the approximate eigenvalue 1/||y,||, since 

Tk—1 


Arp—1 ee es —Adxy. 
Il yell 


Theorem 10.3.2. Assume that A is nonsingular and diagonalizable, with real 
eigenvalues (A1,...,An) associated to real eigenvectors (e1,...,€n), and that 
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Data: A . Output: a © Ai smallest (in modulus) eigenvalue of A 
£k & u eigenvector associated with Ai 
Initialization: 
choose xo E€ R” such that ||xo|| = 1. 
Iterations 
For k > 1 and ||a, — re-1|| < €, 
solve Ayk = Tk—1 
tr = ye/|lyrl 
End 
a = 1/|lyxll 
Algorithm 10.2: Inverse power method. 


the eigenvalue with the smallest modulus, denoted by 1, is simple and positive, 
1.€., 

0 < Ai < [Aa] < < |Anl- 
Assume also that in the basis of eigenvectors, the initial vector reads x9 = 
X; L] biei with bı #0. Then the inverse power method converges, that is, 


lim = Ài, lim wp=2 such that wr = +e). 
k— +00 yell k— +00 


Moreover, the convergence speed is proportional to the ratio |A1|/|A2\|: 


= X A 
yrl -A1 <C (ely and  ||£k — Lool| < C Gy 


Proof. The vector x; is proportional to A~*ao and thus to 


SBA; Fes, 
i=l 
which implies 
k 
Brey + ee Bi (2) €i 
k 
Bier aA > 2 Bi (2) « ej 


Note again that the norm ||y,|| never vanishes, so there is no division by zero 
in Algorithm 10.2. From A; < |A;| for i 4 1, we deduce that x; converges to 
Toa and that ||y,||~+ converges to A1. 


_ 


When the matrix A is real symmetric, the statements of the previous 
theorems simplify a little, since A is automatically diagonalizable and the as- 
sumption on the initial vector xo simply requests that £o not be orthogonal 
to the eigenvector en (or e1). Furthermore, the convergence toward the eigen- 
value is faster in the real symmetric case. We only summarize, hereinafter, 
the power method; a similar result is also valid for the inverse power method. 
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Theorem 10.3.3. Assume that the matrix A is real symmetric, with eigenval- 
ues (A1,.--,;An) associated to an orthonormal basis of eigenvectors (e1,..., en) 
and that the eigenvalue with the largest modulus An is simple and positive, i.e., 


\A1| < re < |An—1| < An: 


Assume also that the initial vector xo is not orthogonal to en. Then the power 
method converges, and the sequence ||y;|| converges quadratically to the eigen- 


value An, that is, 
26 C 
g làn] l 


Remark 10.3.2. The convergence speed of the approximate eigenvector £p is 
unchanged compared to Theorem 10.3.1. However, the convergence of the 


[Ilvei — An 


2 
. ; ; ‘ ee oe 
approximate eigenvalue ||y;|| is faster, since (! TA H) < | TA H. 


Proof of Theorem 10.8.3. In the orthonormal basis (e;)ı<i<n we write the 
initial vector in the form zo = }`;—; Biei. Thus, x, reads 


Bren + iy Bi (> ) i ei 
2k\ 1/27 
(a+ E15 e (#)") 


Tk = 


Then, 


n 


(a t B g G2 (#)") 1/2 


ass N 2(k+1) 1/2 
(a +5 B? (=) ) 


lykill = |[Aaell = An 


which yields the desired result. 


Remark 10.3.3. To compute other eigenvalues of a real symmetric matrix (that 
is, not the smallest, nor the largest), we can use the so-called deflation method. 
For instance, if we are interested in the second-largest-modulus eigenvalue, 
namely An—1, the deflation method amounts to the following process. We 
first compute the largest-modulus eigenvalue À» and an associated unitary 
eigenvector en such that Aen = Anen with ||en|| = 1. Then, we apply again 
the power method to A starting with an initial vector x orthogonal to ey. 
This is equivalent to computing the largest eigenvalue of A, restricted to the 
subspace orthogonal to en (which is stable by A), that is, to evaluating An—1. 
In practice, at each iteration the vector x; is orthogonalized against e, to be 
sure that it belongs to the subspace orthogonal to en. A similar idea works 
for computing the second-smallest-modulus eigenvalue A2 in the framework 
of the inverse power method. In practice, this technique is suitable only for 
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the computation of some extreme eigenvalues of A (because of numerical loss 
of orthogonality). It is not recommended if all or only some intermediate 
eigenvalues have to be computed. 


Let us conclude this section by mentioning that the convergence of the inverse 
power method can be accelerated by various means (see the next remark). 


Remark 10.8.4. Let u ¢ o(A) be a rough approximation of an eigenvalue A 
of A. Since the eigenvectors of (A — uIn) are those of A and its spectrum is 
just shifted by u, we can apply the inverse power method to the nonsingular 
matrix (A — wJ,) in order to compute a better approximation of À and a 
corresponding eigenvector. Note that the rate of convergence is improved, 
since if u was a “not too bad” approximation of A, the smallest-modulus 
eigenvalue of (A — uIn) is (A — u), which is small. 


10.4 Jacobi Method 


In this section we restrict our attention to the case of real symmetric matrices. 
The purpose of the Jacobi method is to compute all eigenvalues of a matrix 
A. Its principle is to iteratively multiply A by elementary rotation matrices, 
known as Givens matrices. 


Definition 10.4.1. Let 0 € R be an angle. Let p 4 q be two different integers 
between 1 and n. A Givens matrix is a rotation matrix Q(p,q,0) defined by 
its entries Qi j(p,q,0): 


Qq,q (Pp, q, 0) = COS 0, 
Qs oD, q, 0) = sinf, 

Qa,p(P, q, 0) = -sin 0, 

Qij(p.9,9) = ĝi; in all other cases. 


The Givens matrix Q(p, q, 0) corresponds to a rotation of angle 0 in the plane 
span {€p, €q}. It has the following shape: 


1 0 
1 
cos 0 sin 0 
1 =p 
Q(p,4,0) = 4 
r g 
— sin cos 0 
1 
0 1 
T T 
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Lemma 10.4.1. The Givens rotation matrix Q(p,q,0) is an orthogonal ma- 
trix that satisfies, for any real symmetric matrix A, 


2 2 
>, b= Do ty 
1<ij<n 1Sijsn 


where B = Q'(p,¢q,9)AQ(p, q,9). Moreover, if apq 40, there exists a unique 
angle 0 € |—%;0| UJ0; 4], defined by cot(20) = aaa a such that 


bog =0 and X bh = aj, +202, (10.4) 
i=l {=l 


Proof. Since Q(p,q,0) is a rotation matrix, it is orthogonal, i.e., QQ* = In. 
A simple computation gives 


XO 07; = || Bll = tr (B'B) = tr (Q'(p, 4,0) A‘ AQ(p, q,0)), 


ij=l 


where || - |r denotes the Frobenius norm. Since tr(MN) = tr(NM), we 
deduce 


tr (Q"(p, q, 0) A‘ AQ(p, q,)) = tr (AA*), 


and thus 3 n 
5 b; j = tr( AtA) = 5 ae; 
t,g=1 1,g=1 


We note that for any angle 0, only rows and columns of indices p and q of B 
change with respect to those of A. In addition, we have 


bpp bpa \__ [ cos@ — sin 0 \( app apq cos@ sind (10.5) 
by.q baqa) \sin@ cosé apq aqq )\ —Sin@ cos@ } ’ 
which leads to 


= : 2 2 
bpp = —24p,q a 0 cos as ap,p COS* A + aqq sin” 0, 
Dag = ap (cos? 0 — ain 0) + (4p. = ägg) ane cos 0, 
ba,g = 2ap, q Sin O cos O + app sin? 0 + ag, q cos” 0. 


Consequently, bp, = 0 if and only if the angle 0 satisfies 
p,q COS 20 + Ser- iaa sin 20 = 0. 


Such a choice of 0 is always possible because cot 26 is onto on R. For this 
precise value of 0, we deduce from (10.5) that bp, = 0 and 


2 20 at 2 2 
b.p + baa = ay p t aq,q og 205.4" 


Since all other diagonal entries of B are identical to those of A, this proves 
(10.4). 
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Definition 10.4.2. The Jacobi method for computing the eigenvalues of a 
real symmetric matrix A amounts to building a sequence of matrices Ak = 
(ak s)i<igen defined by 


A =A, 
Ak+1 = Q' (Pr, dk, Ok) AkQ (Dk, dk, Ok), 


where Q(Pk, dk, 0k) is a Givens rotation matrix with the following choice: 


(i) (Pk; qk) is a pair of indices such that ak g 


(ii) 0, is the angle such that aRt =0. 


| = maxi; lafl, 


Remark 10.4.1. During the iterations, the previously zeroed off-diagonal en- 
tries do not remain so. In other words, although asti, = 0, for further itera- 
tions 1 > k + 2 we have ae # 0. Thus, we do not obtain a diagonal matrix 
after a finite number of iterations. However, (10.4) in Lemma 10.4.1 proves 


that the sum of all squared off-diagonal entries of A; decreases as k increases. 


Remark 10.4.2. From a numerical viewpoint, the angle 0; is never computed 
explicitly. Indeed, trigonometric functions are computationally expensive, 
while only the cosine and sine of 0, are required to compute Aķk+1. Since 
we have an explicit formula for cot 26, and 2 cot 20, = 1/ tan 6, — tan 0k, we 
deduce the value of tan 6, by computing the roots of a second-degree polyno- 
mial. Next, we obtain the values of cos 6; and sin 0k by computing again the 
roots of another second-degree polynomial. 


Theorem 10.4.1. Assume that the matrize A is real symmetric with eigenval- 
ues (Ay,...,An)- The sequence of matrices Ay of the Jacobi method converges, 
and we have 

ce Ap = diag (As(i)), 


where o is a permutation of {1,2,...,n}. 


Theorem 10.4.2. If, in addition, all eigenvalues of A are distinct, then the 
sequence of orthogonal matrices Qg, defined by 


Qk = Q(p1, q, 91) Q(p2, q2, 92) --- Q(Pk, dk, 0k), 


converges to an orthogonal matrix whose column vectors are the eigenvectors 
of A arranged in the same order as the eigenvalues As(i)- 


To prove these theorems, we need a technical lemma. 


Lemma 10.4.2. Consider a sequence of matrices Mp such that 


(i) limp too || Mk+1 = My,|| = 0; 
(ii) there exists C, independent of k, such that ||Mx|| < C for all k > 1; 
(iii) the sequence Mp has a finite number of cluster points (limit points). 
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Then the sequence Mpg converges. 


Proof. Let Aı,..., Ap be all the cluster points of the sequence Mg. For every 
Ai, 1 <i < p, there exists a subsequence Mj ;) that converges to A;. Let us 
show that for all e€ > 0, there exists an integer k(e) such that 


P 
My € |] B(Ai,e), Yk > k(e), 
i=1 
where B(A;,£) is the closed ball of center A; and radius €. If this were not 
true, there would exist £ọ > 0 and an infinite subsequence k’ such that 


|My — Aill > £o 1<i<p. 


This new subsequence My has no A; as a cluster point. However, it is bounded 
in a finite-dimensional space, so it must have at least one other cluster point, 
which contradicts the fact that there is no other cluster point of the sequence 
Mp but the A;. Thus, for € = z min ||A; — Ay ||, there exists k(e) such that 


P 
|My — Mil| <e, and M,e€|JB(Ai,e), Vk > k(e). 


i=l 


In particular, there exists ig such that Mp(e) € B(Aj,,¢). Let us show that 
Mp E€ B(A;,,¢) for all k > k(e). Let kı be the largest integer greater than 
k(e) such that Mp, belongs to B(A;,,¢), but not Mk,+1. In other words, kı 
satisfies 

|My, — Aig || < £ and ||Mz,41 — Aio l| > €. 


Accordingly, there exists i; such that ||M;,,41—Ai, || < £. Therefore we deduce 


| Ai Zi Ai ll < || Mr = Aioll P || Mk +1 = Mk | T | Mk +1 = Ai ll 


3 
< 3E < — min || A; = Aj ||, 
4 ifi’ 


which is not possible because min |A; — Ay|| > 0. Therefore kı = +00. 
Proof of Theorem 10.4.1. We split the matrix A, = (ak )i<igcn as follows: 
Ay = Dy +B, with D, = diag (a¥;). 
Let €k = ||Bx||%, where ||- || is the Frobenius norm. It satisfies 
ex = ||Ax|le — || Dall? 


From ||Ag||7 = ||Ax+ille and ||[De+ill = ||Dell — 2lap,q,|? (by Lemma 
10.4.1), we deduce that 
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Ek+1 = Ek = 2lak al 


Since |a maxjz; lak |, we have 


pea! = 


and so 


2 
Ek+1 [S (1- 5 Jer 
MÁN 


As a result, the sequence e, tends to 0, and 


lim B = 0. 


k— +00 


Since the off-diagonal part of A, tends to 0, it remains to prove the con- 
vergence of its diagonal part Dp. For this purpose we check that Dx satis- 
fies the assumptions of Lemma 10.4.2. The sequence Dp is bounded, because 
Dille < \|Axlle = |All. Let us show that Dy has a finite number of cluster 
points. Since Dz is a bounded sequence in a normed vector space of finite 
dimension, there exists a subsequence Dy that converges to a limit D. Hence, 
Ap converges to D. As a consequence, 


det (D — AI) = pom det (Aw — AI) = PE det Qi, (A= AD Qx 
= det (A — AJ), 


where Qk is the orthogonal matrix defined in Theorem 10.4.2. Since D is 
diagonal and has the same eigenvalues as A, D necessarily coincides with 
diag (A,(;)) for some permutation ø. There exist n! such permutations, there- 
fore D; has at most n! cluster points. Finally, let us prove that (D41 — Dp) 


converges to 0. By definition, df{' = aff! = ak, = dk, if i # p and i £ q. 
a ee k+l „k ; 
It remains to compute dit? — dk, = akt! — ak p (the case ag} — agg is 
symmetric): 
k+l ok _ 9k g k _ ok \ os 2 
app — app = 2a, q Sin Ox COS Ok + (ag q — ap, p) Sin” Ox. 


a® a® 
Tea Ten sin 20k, we obtain 
ap, q 


By the formula cos 20, = 1 — 2sin? 6; = 


k k ak 1 — 2sin? 6, 
a9 “pp ~ pa Gin bg cos 0p 
from which we deduce that 


k sin 0k 
os Ôk 


aktl_ aE — 2a" sin 0y cos ôk +a 


p,p p,p apq —— (1 — 2 sin 29.) = -a$ } tan ĝķ. 


Since 0% € |-3; z], we have |tan 6;| < 1. Thus 
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|| D**" — D| < 2\ak ||? < || Bell = ex, 


which tends to 0 as k goes to infinity. Then we conclude by applying Lemma 
10.4.2. 


Proof of Theorem 10.4.2. We apply Lemma 10.4.2 to the sequence Qk, 
which is bounded because ||Qk||r = 1. Let us show that Q;, has a finite 
number of cluster points. Since it is bounded, there exists a subsequence Qg 
that converges to a limit Q, and we have 


oe Agryi = diag (Ag(i)) = pim Qi AQr = QAQ, 

which implies that the columns of Q are equal to (+/f,(;)), where f; is the 
normalized eigenvector corresponding to the eigenvalue \; of A (these eigen- 
vectors are unique, up to a change of sign, because all eigenvalues are simple 
by assumption). Since there is a finite number of permutations and possible 
changes of sign, the cluster points, like Q, are finite in number. Finally, let us 
prove that (Qk+1 — Qk) converges to 0. We write 


Qr+1 — Qe = (Q(Pr+1, We41; 941) — D Qk. 


By definition, tan 20, =2a% q, /(@%,.q, —Gp,p,)» and Ap converges to diag (A,(:)) 


with distinct eigenvalues. Consequently, for k large enough, 


1 
k min A= A;| > 0. 


(agrar > re -— 95 


Since limp. 4.66 a = 0, we deduce that 4; tends to 0, and so (Qk+1 — Qk) 
converges to 0. Applying Lemma 10.4.2 finishes the proof. 


10.5 Givens—Householder Method 


Once again we restrict ourselves to the case of real symmetric matrices. The 
main idea of the Givens-Householder method is to reduce a matrix to its 
tridiagonal form, the eigenvalues of which are easier to compute. 


Definition 10.5.1. The Givens—Householder method is decomposed into two 
successive steps: 


1. By the Householder method, a symmetric matrix A is reduced to a tridi- 
agonal matriz, that is, an orthogonal matrix Q is built such that Q'AQ is 
tridiagonal (this first step is executed in a finite number of operations). 

2. The eigenvalues of this tridiagonal matriz are computed by a bisection (or 
dichotomy) method proposed by Givens (this second step is an iterative 
method). 
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We begin with the first step, namely, the Householder method. 


Proposition 10.5.1. Let A be a real symmetric matrix. There exist (n — 2) 
orthogonal matrices Hp such that 


T = (HA, Hs- Hy_2)'A(M) Ho- +- Hno) 
is tridiagonal. Note that A and T have the same eigenvalues. 


Proof. Starting with A, we build a sequence of matrices (Ak)1<k<n—1 such 
that Ay = A, and Ay; = Hj. Ak Hp, where H; is an orthogonal matrix chosen 
in such a way that A; has the following block structure: 


In the above, Tẹ is a tridiagonal square matrix of size k, Mpk is a square matrix 
of size n — k, and Ex is a rectangular matrix with (n — k) rows and k columns 
whose last column only, denoted by a, € R"~*, is nonzero: 


x x 0...0 ara 

ea Ss of pe 

Le and Ep, = : 
x xX 0...0 ük, n—k 


Thus, it is clear that A,,_, is a tridiagonal matrix. We note that A is indeed 
of this form for k = 1. Let Hp be the matrix defined by 


Ik 0 
A, = ~ |, 


where J; is the identity matrix of order k, and H, is the Householder matrix 
(see Lemma 7.3.1) of order n — k defined by 


t 
Ay, = In, — aE with vk = ap + |lax|ler, (10.6) 
Uk 


where e; is the first vector of the canonical basis of R”~*. It satisfies H Ear = 
—||ax||e1. We observe that Hj, is orthogonal and Hj = Hp. The definition 
(10.6) of the Householder matrix is valid only if a, is not parallel to e1; 
otherwise, the kth column of A, is already of the desired form, so we take 
Hy, = In—~. We compute Agi = Hi Ap Hp: 


An = Tk (Hr Er) 
KFI E, Ay M, Hy)’ 


where 
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0...0—|lagl| 
Hy, Ex = a 
0...0 0 


Accordingly, Ak+ı takes the desired form: 


Tks Et 
Arpii= ktt |. 
aie Ge Mri 
where Tk+1 is a square tridiagonal matrix of size k+1, Mk+1 is a square matrix 


of size n — k — 1, and Ex 4, is a rectangular matrix of size (n — k —1,k +1) 
whose only nonzero column is the last one. 


Now we proceed to the second step of the algorithm, that is, the Givens 
bisection method. 


Lemma 10.5.1. Consider a real tridiagonal symmetric matrix 


by C1 0 
A=] % 
2s Cn—1 
0 Cn—1 bn 


If there exists an index i such that c; = 0, then 


det (A = Ala) = det (Ai = AIi) det (An-i = Alna) 


with 
by Ci 0 bi Ci+1 0 
Aj= Al and An; = a 
È ta Cri E "e Cn—1 
0 Cii. b; 0 Cn—1 bn 


The proof of Lemma 10.5.1 is easy and left to the reader. It allows us to 
restrict our attention in the sequel to tridiagonal matrices with c; # 0 for all 
ie {l,...,n—-1}. 


Proposition 10.5.2. For 1 < i< n, let A; be the matrix of size i defined by 


by Cy 0 
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with ci #0, and let p;(A) = det (A; — AZ;) be its characteristic polynomial. 
The sequence p; satisfies the induction formula 


pi(A) = (bi — A)pi-1 (A) — c2_ypi-2(), Vi = 2, 


with 
po(A)=1 and pi(A) = bi —d. 
Moreover, for alli > 1, the polynomial p; has the following properties: 
ve Jim pi(A) = +00; 
2. if pi(Ao) = 0, then pi_1(Ao)piti(Ao) < 0; 
3. the polynomial pi has i real distinct roots that strictly separate the (i+ 1) 
roots of Di+1- 


Remark 10.5.1. A consequence of Proposition 10.5.2 is that when all entries 
Ci are nonzero, the tridiagonal matrix A has only simple eigenvalues (in other 
words, they are all distinct). 


Proof of Proposition 10.5.2.. Expanding det (A; — AJ;) with respect to the 
last row, we get the desired induction formula. The first property is obvious by 
definition of the characteristic polynomial. To prove the second property, we 
notice that if pj(Ao) = 0, then the induction formula implies that pj;+1(Ao) = 
—c?p;_1(Ao). Since c; # 0, we deduce 


pi-1(Ao)pit1(Ao) < 0. 


This inequality is actually strict; otherwise, if either p;_1(Ao) = 0 or pi+1ı(å0) = 
0, then the induction formula would imply that p(o) = 0 for all 0 < k <i+1, 
which is not possible because po(Ao) = 1. 

Concerning the third property, we first remark that p;() has 7 real roots, 
denoted by Ai < --- < Af, because A; is real symmetric. Let us show by 
induction that these 7 roots of p; are distinct and separated by those of p;_1. 
First of all, this property is satisfied for i = 2. Indeed, 


p2() = (b2 — A)(b1 — A) = ci 


has two roots (A7, A3) that strictly bound the only root At = bı of pi(A), i.e., 
A? < At < AZ. Assuming that p;(A) has i real distinct roots separated by those 
of pi—1, we now study the 7+ 1 real roots of p;+1. We define a polynomial qi 
of degree 2i by 

gi(A) = pi-1(A)pi+1 (A). 


We already know i — 1 roots of qi (those of p;_1), and we also know that the 
i roots of p; are such that q;(Aj,) < 0. In other words, 


g(M-") =0, fori<k<i-1, (Ai) <0, forl<k <3, 


with (see Figure 10.1) 
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Between Xj, and Aż}, either q; vanishes at another point yẹ 4 Ai}, in which 
case we have found another root of qi, hence of pi+1, or qi vanishes only at 
aa meaning it is at least a double root, since its derivative q} has to vanish 
at Me too. On the other hand, pe is a simple root of p;_1, so A is alsoa 
root of pi+ı. Because of the induction relation, this would prove that AC * is 
a root for all polynomials p; with 0 < j <i+1, which is not possible because 
po = 1 has no roots. As a consequence, we have proved that between each 
pair Ai, Af; there exists another root y # \j, > of the polynomial qi, thus 
of pi+ı. Overall, we have just found (i — 1) distinct roots of pi+ı that bound 
those of p;. Moreover, gi(Aj{) < 0 and qi(A}) < 0, whereas 


lim q;(à) = +00. 


A— +00 


So we deduce the existence of two more distinct roots of qi, hence of pi+ı (for 
a total number of i + 1 roots), that intertwine those of pi. 


Fig. 10.1. Example of p; polynomials in Proposition 10.5.2. 


Proposition 10.5.3. For all u € R, we define 
oe _ f sign of pi(u) if pilu) £0, 
sen pi(il) = ee of pi-1(H4) if pilu) = 0. 


Let N(i,) be the number of sign changes between consecutive elements of the 


set E(i,u) = {+1, sgnpi(u), sgn po(u),.--, sgnpi(u)}. Then N(i, u) is the 
number of roots of p;i that are strictly less than u. 


Proof. First, we note that sgn p;(u) is defined without ambiguity, since if 
pilu) = 0, then pi—1(u) 4 0 because of the second point in Proposition 10.5.2. 
We proceed by induction on i. For i = 1, we check the claim 


h < bi > E(1, u) = {+1, +1} > N(1, u) =0, 
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and 
h> by => E(1,#) = {+1,-1} > N(1, p) =1. 


We assume the claim to be true up till the ith order. Let (XT icke be 
the roots of pi+ı and (Aj,)1<z<i be those of p;, sorted in increasing order. By 
the induction assumption, we have 


In addition, 
i i+1 i 
ANG, u ) < ANG n41 < Ain) 419 


by virtue of the third mee in Proposition 10.5.2. Therefore, there are three 
possible cases. 


X First case. If ANG) n <HS Aia then sgn pi+ı(u) = sgn p:(u). Thus 
NG +1, u) = NG, u). 

X Second case. If ANG oat SHS ANG, y+)? then sgn pi+ı( 4) = — sgn p; (2). 
Thus N(i +1, ) = N(2, ) +1. 

X Third case. If u = ANG, u)-+1 then sen p;(j) = sgn pji—1(u) = — sen pi4i (pL), 
according to the second point in Proposition 10.5.2. Therefore N(i + 
1,p) = N(i,p) +1. 


In all cases, N(i+ 1, u) is indeed the number of roots of p;+1 that are strictly 
smaller than u. 


We now describe the Givens algorithm that enables us to numerically 
compute some or all eigenvalues of the real symmetric matrix A. We denote 
by Ay < ++- < An the eigenvalues of A arranged in increasing order. 


Givens algorithm. In order to compute the ith eigenvalue A; of A, we 
consider an interval [ao, bo] that we are sure A; belongs to (for instance, 
—ao = bo = |All). Then we compute the number N(n, %42) defined in 
Proposition 10.5.3 (the values of the sequence p;( 222%), for 1 < j < n, are 
computed by the induction formula of Proposition 10.5.2). If N (n, %32) > i, 
then we conclude that A; belongs to the interval [ao, sotta |. If, on the contrary, 
N(n, oto) < i, then A; belongs to the other interval [soto , bo]. In both cases, 
we have divided by two the initial interval that contains A;. By dichotomy, 
that is, by repeating this procedure of dividing the interval containing \;, we 
approximate the exact value A; with the desired accuracy. 


Remark 10.5.2. The Givens—Householder method allows us to compute one 
(or several) eigenvalue(s) of any rank i without having to compute all the 
eigenvalues (as is the case for the Jacobi method) or all the eigenvalues be- 
tween the ith and the first or the last (as is the case for the power method 
with deflation). 
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10.6 QR Method 


The QR method is the most-used algorithm to compute all the eigenvalues of 
a matrix. We restrict ourselves to the case of nonsingular real matrices whose 
eigenvalues have distinct moduli 


0 < fàil <- < àn] < lnl. (10.7) 


The analysis of the general case is beyond the scope of this course (for more 
details we refer the reader to [3], [7], [13], and [15]). A real matrix A sat- 
isfying (10.7) is necessarily invertible, and diagonalizable, with distinct real 
eigenvalues, but we do not assume that it is symmetric. 


Definition 10.6.1. Let A be a real matrix satisfying (10.7). The QR method 
for computing its eigenvalues consists in building the sequence of matrices 
(Ak)k>1 with Ay = A and 

Ax+i = RkQk, 
where QkRk = Ap is the QR factorization of the nonsingular matrix Ap (see 
Section 6.4). 


Recall that the matrices Q; are orthogonal and the matrices Rę are upper 
triangular. We first prove a simple technical lemma. 


Lemma 10.6.1. 1. The matrices Ay are all similar, 


Anti = QArQk, (10.8) 
Ar =(Q™ AQ", (10.9) 
with QP) = Qi- Qk. 
2. For all k > 1, 
AQP) = QUtY Raat. (10.10) 


3. The QR factorization of A}, the kth power of A (not to be confused with 
Ak), as 


Ak = QOR” (10.11) 
with R® = Rp- Ry. 
Proof. 
1. By definition, Apii = RkQr = Qh (QeRe)Qe = Qi (Ak)Qk, and by in- 
duction, 
Arti = Q(An)Qx = QQj-1(An—1)Qe-19% = «+> = [QED AQU, 


2. We compute 


Q Re = Q1 ++ Qk-2Qk-1(Qr Re) = Qi ++ Qe—-2Qn—1(Ar) 
= Qi e Qk-2Qk-1(Rk-1Qk-1) = Q1 ++ Qe—2(An-1)Qe-1 
= Qi e Qk-2(Rk-2Qk-2)Qk-1 = Q1- ++ (Qkr-2Rk-2)Qk-2Qk-1 
= Qi e (Ak-2)Qk-2Qr-1 = AQU-Y. 
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3. Using the previous identity, we get 
QORA = (Q Rp) RED = AQ’*-D RED, 


and by induction, (10.11). 


Remark 10.6.1. Remarkably, the algorithm of Definition 10.6.1 features, as a 
special case, the power method and the inverse power method. Let a”? = 
(Reis; aÑ = (Rk)n,n, and ul) (resp. u) denote the first (resp. last) 
column of Q“), i.e., 

Qh) = [wt | u 


Comparing the first column on both sides of identity (10.10), we get 
Aus? = atut, (10.12) 


Since lutt” la = 1, we have aft» = Au la, and formula (10.12) is 
just the definition of the power method analyzed in Section 10.3. Under the 
hypothesis of Theorem 10.3.1 we know that the sequence a converges to |A»| 
and uP converges to an eigenvector associated to An. 

On the other hand, (10.10) reads A = Q@*)) Rp1 [Q®]t, which implies 
by inversion and transposition (recall that by assumption, A and Ry, are non- 
singular matrices) 


= k = 
AQA = QMO Riy 
Comparing the last columns on both sides of this identity, we get 


1 


=t, (k) _ (k+1) 


Since lutt” = 1, we deduce af**)) = Atuf], and formula (10.13) 


is just the inverse power method. Under the hypothesis of Theorem 10.3.2 
we know that the sequence af? converges to |A;| and ul) converges to an 
eigenvector associated to Aj. 


The convergence of the QR algorithm is proved in the following theorem 
with some additional assumptions on top of (10.7). 


Theorem 10.6.1. Let A a be a real matrix satisfying (10.7). Assume further 
that P7! admits an LU factorization, where P is the matrix of eigenvectors 
of A, i.e., A= P diag (An,...,A1)P7'. Then the sequence (Ax)x>1, generated 
by the QR method, converges to an upper triangular matrix whose diagonal 
entries are the eigenvalues of A. 
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Proof. By assumption (10.7), the matrix A is diagonalizable, i.e., A = 
PDP! with D = diag (An,-.-,A1). Using the LU factorization of P~' and 
the QR factorization of P we obtain 


AP = P(D*) P~! = (ORD LU = (ORD LD“ \(D" 0). 


The matrix D*LD~-* is lower triangular with entries 


0 for i <j, 
(DLD); =e 1 N , fori = Js 
(q+) Liz for i > J. 


Hypothesis (10.7) implies that limy4.0(D*LD~*);; = 0 for i > j. Hence 
D*LD-* tends to the identity matrix I, as k goes to infinity, and we write 


D*LD-* = I, + Ex, with „im Br = On 


as well as 
A® = (QR)(In + Ex)(D*U) = QUIn + RER} RD*U. 


For k large enough, the matrix In + RER} is nonsingular and admits a 
QR factorization: In + RER! = Qk Rp. Since Ry, RD*U) is upper triangular 
as a product of upper triangular matrices (see Lemma 2.2.5), it yields a QR 
factorization of A*: ty 

AF = (QQx)(RERD*U). 


From (10.11) we already know another QR factorization of A*. Thus, by 
uniqueness of the QR factorization of A* (see Remark 6.4.2) there exists a 
diagonal matrix Dp such that 


QO = QQcDe with (Dril = 1. 
Plugging this into the expression for A,+1 given in (10.9), we obtain 
Anti = [Q0 Dr] AQ Du] = DilQEQ"*AQQ:] De. (10.14) 


The entries of A,41 are 


(Anat)ig = (Diii lQ AQQ] (Dr) = Q AQQ (10.15) 
In particular, the diagonal entries of Akı are 
(Art1)ii = (Dh): lQEQ* AQQa]i,i(Da)ii = (QEQ* AQQr]ii: 
Now we make two observations. 


© Firstly, Q* AQ = Q*(PDP-)Q = Q*(QRD(QR)-1)Q = RDR-. 
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e Secondly, the sequence (Qp) is bounded (||Qzl|z2 = 1); it converges to 
a matrix Q (consider first a subsequence, then the entire sequence). As 
a consequence, the upper triangular matrices Ry = Out, + RER!) 


converge to the unitary matrix o. which is also triangular (as a limit of 
triangular matrices), hence it is diagonal (see the proof of Theorem 2.5.1), 
and its diagonal entries are +1. 


Using these two remarks, we can pass to the limit in (10.15): 


pim (Arta)ig = (RDR ’);;, 


which is equal to zero for i > j, and 


lim (Agyi)i¢ = (RDR ))i i = Dii = Anti—i 

k—+00 
because of Lemma 2.2.5 on the product and inverse of upper triangular ma- 
trices (R is indeed upper triangular). In other words, the limit of A; is upper 
triangular with the eigenvalues of A on its diagonal. 


Remark 10.6.2. For a symmetric matrix A, the matrices A, are symmetric too, 
by virtue of (10.9). From Theorem 10.6.1 we thus deduce that the ath of Ax 
is a diagonal matrix D. Since the sequence (Q“)); is bounded ({|Q™ ||2 = 1), 
up to a subsequence it converges to a unitary matrix Q(%). Passing to the 
limit in (10.9) yields D = [Q‘)|*AQ), which implies that Q(©) is a matrix 
of eigenvectors of A. 


Let us now study some practical aspects of the QR method. The compu- 
tation of Azi1 from A, requires the computation of a QR factorization and 
a matrix multiplication. The QR factorization should be computed by the 
Householder algorithm rather than by the Gram-Schmidt orthonormalization 
process (see Remark 7.3.3). A priori, the QR factorization of a matrix of order 
n requires on the order of O(n3) operations. Such a complexity can be drasti- 
cally reduced by first reducing the original matrix A to its upper Hessenberg 
form. An upper Hessenberg matrix is an “almost” upper triangular matrix as 
explained in the following definition. 


Definition 10.6.2. An nx n matrix T is called an upper Hessenberg matrix 
if T;,; =0, for all integers (i, j) such thati > j +1. 


We admit the two following results (see [3], [7], [13], and [15] if necessary). 
The first one explains how to compute the upper Hessenberg form of a matrix 
(it is very similar to Proposition 10.5.1). 


Proposition 10.6.1. For any n x n matriz A, there exists a unitary matrix 
P, the product of n — 2 Householder matrices Hı,..., Hn—2, such that the 
matrix P* AP is an upper Hessenberg matriz. 
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Note that the Hessenberg transformation A — P* AP preserves the spectrum 
and the symmetry: if A is symmetric, so is P* AP. Hence the latter matrix is 
tridiagonal (in which case Proposition 10.6.1 reduces to Proposition 10.5.1) 
and the Givens algorithm is an efficient way to compute its spectrum. The 
cost of the Hessenberg transformation is O(n*), but it is done only once before 
starting the iterations of the QR method. 

The second admitted result states that the structure of Hessenberg matri- 
ces is preserved during the iterations of the QR method. 


Proposition 10.6.2. If A is an upper Hessenberg matrix, then the matrices 
(Ax)e>1, defined by the QR method, are upper Hessenberg matrices too. 


The main practical interest of the upper Hessenberg form is that the QR 
factorization now requires only on the order of O(n”) operations, instead of 
O(n?) for a full matrix. 

In order to implement the QR method, we have to define a termination 
criterion. A simple one consists in checking that the entries (Aj);,;-1 are very 
small (recall that the sequence A; is upper Hessenberg). We proceed as follows: 
if (Ag) nn—1 is small, then (Ax). is considered as a good approximation of 
an eigenvalue of A (more precisely, of the smallest one À; according to Remark 
10.6.1) and the algorithm continues with the (n—1) x (n— 1) matrix obtained 
from A;, by removing the last row and column n. This is the so-called deflation 
algorithm. Actually, it can be proved that 

k 
; (10.16) 


which defines the speed of convergence of the QR method. 

According to formula (10.16) for the speed of convergence, it is possible 
(and highly desirable) to speed up the QR algorithm by applying it to the 
“shifted” matrix A — In instead of A. The spectrum of A is simply recovered 
by adding o to the eigenvalues of A — oI,,. Of course ø is chosen in such a 


way that 
i= el" Ale 
O O | |— 
(i=l) <el): 


i.e., ø is a good approximation of the simple real eigenvalue 1. More precisely, 
the QR algorithm is modified as follows; 


At 


A n,n— =0 xt 
(njaaa (| 


1. compute the QR factorization of the matrix Ak — ok In: 
Ak — Okln = Qk Rk; 


2. define Agyi = RkQk + Ol a a QI AkQk- 


Here the value of the shift is updated at each iteration k. A simple and efficient 
choice is 
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Data: matrix A and integer N (maximal number of iterations) 
Output: v a vector containing the eigenvalues of A 
Initialization: 
€ =, N =, define the error tolerance and total number of iterations 
mH nha 1 
a = hess(A). Compute the Hessenberg reduced form of A 
Iterations 
While k =1,...,N andm>1 
If lammil XE 
v(m) = üm 


a(:,m) = [] delete column m of a 
a(m,:) = |] delete row m of a 
m=m-—1 
End 
compute (Q, R) the QR factorization of a 
a= RQ 
k=k+1 
End 
v(1) = a(1,1) 


Algorithm 10.3: QR method. 


Ok = (Ak)n,n- 


The case of real matrices with complex eigenvalues is more complicated (see, 
e.g., [13] and [15]). Note that these matrices do not fulfill the requirement 
(10.7) on the separation of the spectrum, since their complex eigenvalues 
come in pairs with equal modulus. 


10.7 Lanczos Method 


The Lanczos method computes the eigenvalues of a real symmetric matrix 
by using the notion of Krylov space, already introduced for the conjugate 
gradient method. 

In the sequel, we denote by A a real symmetric matrix of order n, 
ro € R” some given nonzero vector, and Kę the Krylov space spanned by 
{ro, Aro,..., A*ro}. Recall that there exists an integer ko < n — 1, called the 
Krylov critical dimension, which is characterized by dim Ką = k+1 if k < ko, 
while Kk = Kko if k > ko. 

The Lanczos algorithm builds a sequence of vectors (vj)1<j<ko+1 by the 
following induction formula: 


TO 


-roll 


Vo = 0, UL (10.17) 


and for 2 < j < ko +1, 


10.7 Lanczos Method 215 


vj = A T with 0; = Avj-1 — (Avj-1, vj-1)vj-1 = |[0;~1||vj~2. (10.18) 
J 


We introduce some notation: for all integer k < kg + 1, we define an n x k 
matrix Vp whose columns are the vectors (v1,...,U,), as well as a tridiagonal 
symmetric matrix Tk of order k whose entries are 


(Tk)ii = (Avi, vi), (Te)i iti = (Tait = Wisi], (Tk)ig = 0 if li — j| > 2. 
The Lanczos induction (10.17)—(10.18) satisfies remarkable properties. 


Lemma 10.7.1. The sequence (v;)1<j<ko+1 is well defined by (10.18), since 
\|O;|| AO for all 1 < j < ko +1, whereas t,,42 = 0. For 1 < k < ko +1, 
the family (v1,...,Ug+41) coincides with the orthonormal basis of Kp built by 
application of the Gram-Schmidt procedure to the family (ro, Aro,..., Aro). 
Furthermore, for 1 < k < ko +1, we have 


AVk = VkTk + Ôk+1 €$, (10.19) 
where ep is the kth vector of the canonical basis of RE, 
ViAVe = Tk, and ViV, =Ik, (10.20) 
where Ip is the identity matrix of order k. 


Remark 10.7.1. Beware that the square matrices A and Tk are of different 
sizes, and the matrix V; is rectangular, so it is not unitary (except if k = n). 


Proof. Let us forget for the moment the definition (10.17)—(10.18) of the 
sequence (vj) and substitute it with the new definition (which we will show 


to be equivalent to (10.18)) vo = 0, vı = 7-27, and for j > 2, 
ô; a 
vj = TET where 0; = Avj;-1 = S (Avj, Uj) Ui. (10.21) 
Vj i=1 
Of course, (10.21) is meaningless unless ||ô;|| # 0. If ||ĉ;|| = 0, we shall say 


that the algorithm stops at index j. By definition, v; is orthogonal to all v; for 
1<i<j-—1. By induction we easily check that v; € Kj—1. The sequence of 
Krylov spaces K; is strictly increasing for j < ko + 1, that is, Kj;-1 C Kj and 
dim Kj-ı =j — 1 < dim K; = j. Therefore, as long as the algorithm has not 
stopped (i.e., ||ô;|| 4 0), the vectors (v1, ...,vj) form an orthonormal basis of 
Kj;j-1. Consequently, vj, being orthogonal to (v1,...,vj—1), is also orthogonal 
to Kj—2. Hence, according to the uniqueness result of Lemma 9.5.1, we have 
just proved that the family (v1,...,v,;), defined by (10.21), coincides with 
the orthonormal basis of Kj—1 built by the Gram-Schmidt procedure applied 
to the family (ro, Aro,..., Af ro). In particular, this proves that the only 
possibility for the algorithm to stop is that the family (ro, Arg,..., AIT tro) is 
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linearly dependent, i.e., that j —1 is larger than the Krylov critical dimension 
ko. Hence, ||ĉ;|| 4 0 as long as j < ko + 1, and t%,42 = 0. 

Now, let us show that definitions (10.18) and (10.21) of the sequence (v;) 
are identical. Since A is symmetric, we have 


(Avj;-1, vi) = (vj—1, Avi) = (vj—1, 0541) + X (Avi, Uk) (uj-1, vk). 
k=1 


Thanks to the orthogonality properties of (vp), we deduce that (Av;_1,v;) = 0 
if 1 <i < j—3 and (Avj-1,v;~2) = ||0;-1||. Therefore, definitions (10.18) 
and (10.21) coincide. 

Finally, the matrix equality (10.19), taken column by column, is nothing 


else than (10.18) rewritten, for 2 < j < k, 


Avj = |\0j lu; + (Avj-1, (ya) Ope + [107-1 0}; 
and 
Avrp = Ôk+1 + (Aug, Uk) Uk + |Ox||0n—1- 


The property that VV, = I, is due to the orthonormality properties of 
(v1,...,; Ug), Whereas the relation Vi AV; = Tk is obtained by multiplying 
(10.19) on the left by Vý and taking into account that V/t,41 = 0. 


Remark 10.7.2. The computational cost of the Lanczos algorithm (10.18) 
is obviously much less than that of the Gram-Schmidt algorithm applied 
to the family (ro, Aro,..., A*ro), which yields the same result. The point is 
that the sum in (10.18) contains only two terms, while the corresponding sum 
in the Gram-Schmidt algorithm contains all previous terms (see Theorem 
2.1.1). When the Krylov critical dimension is maximal, i.e., ko = n — 1, rela- 
tion (10.19) or (10.20) for k = ko + 1 shows that the matrices A and Tko+1 
are similar (since Vkọ+1 is a square nonsingular matrix if kg = n—1). In other 
words, the Lanczos algorithm can be seen as a tridiagonal reduction method, 
like the Householder algorithm of Section 10.5. Nevertheless, the Lanczos al- 
gorithm is not used in practice as a tridiagonalization method. In effect, for 
n large, the rounding errors partially destroy the orthogonality of the last 
vectors vj with respect to the first ones (a shortcoming already observed for 
the Gram-Schmidt algorithm). 


We now compare the eigenvalues and eigenvectors of A and T;,41. Let us 
recall right away that these matrices are usually not of the same size (except 
when ko + 1 = n). Consider Ay < Ag < +++ < Am, the distinct eigenvalues of 
A (with 1 < m < n), and P,,..., Pm, the orthogonal projection matrices on 
the corresponding eigensubspaces of A. We recall that 


A= àP, f= Dh). and BP = Ors 7. (10.22) 


i=l {=l 
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Lemma 10.7.2. The eigenvalues of T;,,41 are simple and are eigenvalues of 
A too. Conversely, if ro satisfies Piro #0 for alll < i < m, then ko+ 1 = m, 
and all eigenvalues of A are also eigenvalues of Tko+1- 


Remark 10.7.8. When Piro # 0 for all i, A and T;,41 have exactly the same 
eigenvalues, albeit with possibly different multiplicities. The condition im- 
posed on rg for the converse part of this lemma is indeed necessary. Indeed, 
if ro is an eigenvector of A, then ko = 0 and the matrix Tkọ+1 has a unique 
eigenvalue, the one associated with ro. 


Proof of Lemma 10.7.2. Let À and y € R*°*! be an eigenvalue and eigenvec- 
tor of Thy 41, 1e., Tko+1Y = Ay. Since Ôkg+2 = 0, (10.19) becomes for k = ky +1 


AV go 41 — Vico +1 Tho +1: 


Multiplication by the vector y yields A(Vig4iy) = A (Vko+14). The vector 
Vko+1y is nonzero, since y # 0 and the columns of Vkọ+1 are linearly in- 
dependent. Consequently, Vi,,41y is an eigenvector of A associated with the 
eigenvalue A, which is therefore an eigenvalue of A too. 

Conversely, we introduce a vector subspace Em of R”, spanned by the 
vectors (Piro, ..., Pmro), which are assumed to be nonzero, Piro Æ 0, for all 
1 < i < m. These vectors are linearly independent, since projections on P; 
are mutually orthogonal. Accordingly, the dimension of Em is exactly m. Let 
us show under this assumption that m = ky + 1. By (10.22) we have 


m 
A*ro = X Piro, 
i=l 


i.e., AFro € Em. Hence the Krylov spaces satisfy Ky C Em for all k > 0. In 
particular, this implies that dim Kk = ko +1 < m. On the other hand, in 
the basis (Piro, ..., Pmro) of Em, the coordinates of A¥ro are (Af,...,A*,). 
Writing the coordinates of the family (ro, Aro,..., A™ tro) in the basis 
(Piro, ..., Pmro) yields a matrix representation M: 


1 Ar A aka 
M=|: ; 


? 


Í Am OG sen ARI 
which is just a Vandermonde matrix of order m. It is nonsingular, since 
m=i 
det (M) = JI [[e = Xi) 
i=1 j>i 


and the eigenvalues A; are distinct. As a result, the family (ro, Aro,..., A™ tro) 
is linearly independent, which implies that dim Km-1 = m, and accordingly 
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m—1< ko. We thus conclude that m = ko +1 and Em = Kk. On the other 
hand, multiplying (10.22) by P;ro yields 


A(P;ro) = Aài(Piro). 


Since P;ro is nonzero, it is indeed an eigenvector of A associated with the 
eigenvalue A;. Because Em = Kk, and the columns of V;,41 are a basis of 
Kko, we deduce the existence of a nonzero vector y; E R™ such that 


Piro = Veo+1¥i- 
We multiply the first equality of (10.20) by y; to obtain 


Trot ¥i = Vin41AVkot1¥i = Venti APiro 
= Ven Piro = Ai Vg 41 Vko+1Yi = AiYi, 


which proves that y; is an eigenvector of Tkọ+1 for the eigenvalue A;. 


In view of Lemma 10.7.2 one may believe that the Lanczos algorithm has 
to be carried up to the maximal iteration number ko + 1, before computing 
the eigenvalues of Tkọ+1 in order to deduce the eigenvalues of A. Such a 
practice makes the Lanczos method comparable to the Givens—Householder 
algorithm, since usually ko is of order n. Moreover, if n or ko is large, the 
Lanczos algorithm will be numerically unstable because of orthogonality losses 
for the vectors v; (caused by unavoidable rounding errors; see Remark 10.7.2). 
However, as for the conjugate gradient method, it is not necessary to perform 
as many iterations as ky + 1 to obtain good approximate results. Indeed, in 
numerical practice one usually stops the algorithm after k iterations (with 
k much smaller than kg or n) and computes the eigenvalues of Tk, which 
turn out to be good approximations of those of A, according to the following 
lemma. 


Lemma 10.7.3. Fix the iteration number 1 < k < ko +1. For any eigenvalue 
à of Tk, there exists an eigenvalue A; of A satisfying 


[A = Aal < llôk+ll- (10.23) 


Furthermore, if y € R! is a nonzero eigenvector of Tk associated to the eigen- 
value À, then there exists an eigenvalue A; of A such that 


x €k; 
=A < loes | (10.24) 


where ep is the kth vector of the canonical basis of RE. 


Remark 10.7.4. The first conclusion (10.23) of Lemma 10.7.3 states that the 
eigenvalues of Tẹ are good approximations of some eigenvalues of A, provided 
that ||¢;41]] is small. The second conclusion (10.24) is the most valuable one: 
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if the last entry of y is small, then A is a good approximation of an eigenvalue 
of A even if ||ôk+1l|| is not small. In practice, we test the magnitude of the last 
entry of an eigenvector of Tk to know whether the corresponding eigenvalue 
is a good approximation of an eigenvalue of A. For details of the numerical 
implementation of this method we refer to more advanced monographs, e.g., 
[2]. The Lanczos method is very efficient for large n, since it gives good results 
for a total number of iterations k much smaller than n, and it is at the root 
of many fruitful generalizations. 


Proof of Lemma 10.7.3. Consider a nonzero eigenvector y € R* such that 
Tey = Ay. Multiplying (10.19) by y, we get 


AVky = VkTky + (ek, Y) Ôk+1, 
from which we deduce 
A(Vry) — A(Vky) = (€k, Y) Ôk+1- (10.25) 


Then, we expand V;,y in the eigenvectors basis of A, Vky = J} ;2} Pi(Vky). 
Taking the scalar product of (10.25) with Vy, and using relations (10.22) 


leads to 
m 


SO u- A) [Pi (Vky)|? = (ex, y) (rti, Vey). (10.26) 
i=1 
Applying the Cauchy-Schwarz inequality to the right-hand side of (10.26) 
yields 
min [As = AL Vayl < lll dull Veal (10.27) 


Since the columns of Vp are orthonormal, we have ||Viy|| = ||y||, and simplify- 
ing (10.27), we obtain the first result (10.23). This conclusion can be improved 
if we do not apply Cauchy—Schwarz to the term (ex, y} in (10.26). In this case, 
we directly obtain (10.24). 


10.8 Exercises 


10.1. What is the spectrum of the following bidiagonal matrix (called a 
Wilkinson matrix)? 


n n 0 0 
0 n-l n 0 
W(n) =]: E€ M,(R) 
; 2 0 
O ... ... O 1 


For n = 20, compare (using Matlab) the spectrum of W(n) with that of the 
matrix W(n) obtained from W(n) by modifying the single entry in row n and 
column 1, W(n,1) = 1071°. Comment. 
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10.2. Define 
7.94 5.61 4.29 111 1 
A= | 5.61 —3.28 —2.97 |, T=]101], and b=] 1 
4,29 —2.97 —2.62 001 1 


1. Compute the spectrum of A and the solution x of Ax = b. 

2. Define A; = A+0.01T. Compute the spectrum of A, and the solution x; 
of Aix =b. 

3. Explain the results. 


10.3. The goal of this exercise is to study the notion of left eigenvectors for a 
matrix A E€ Mn (©). 


1. Prove the equivalence \ € o(A) => \ € o(A*). 

2. Let À € o(A) be an eigenvalue of A. Show that there exists (at least) one 
nonzero vector y € C” such that y*A = Ay*. Such a vector is called a left 
eigenvector of A associated with the eigenvalue A. 

. Prove that the left eigenvectors of a Hermitian matrix are eigenvectors. 

4. Let À and p be two distinct eigenvalues of A. Show that all left eigenvectors 

associated with À are orthogonal to all left eigenvectors associated with 


h- 


w 


5. Use Matlab to compute the left eigenvectors of the matrix 
1 —2 —2 -2 
—4 0 —2 —4 
a= 12 4 2 
3 1 1 5 


10.4. Let A E€ M,,(C) be a diagonalizable matrix, i.e., there exists an invert- 
ible matrix P such that P~'AP = diag (à1,..., An), where the A; are the 
eigenvalues of A. Denote by x; an eigenvector associated to A;, and y; a left 
eigenvector associated to À; (see Exercise 10.3). 


1. Define Q* = P~!. Prove that the columns of Q are left eigenvectors of 
the matrix A. 
2. Deduce that if the eigenvalue A; is simple, then yžzx; Æ 0. 


10.5. We study the conditioning of an eigenvalue. 


1. Define a matrix 
—97 100 98 


A= 1 2-1 
—100 100 101 


Determine the spectrum of A. We define random perturbations of the 
matrix A by the command B=A+0.01*rand(3,3). Determine the spectra 
of B for several realizations. Which eigenvalues of A have been modified 
the most, and which ones have been the least? 
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2. The goal is now to understand why some eigenvalues are more sensitive 
than others to variations of the matrix entries. Let Ag be a simple eigen- 
value of a diagonalizable matrix Ao. We denote by xo (respectively, yo) an 
eigenvector (respectively, a left eigenvector) associated to Ao (we assume 
that ||zoll2 = ||yol|2 = 1). For € small, we define the matrix A; = Ag +¢cE, 
where E is some given matrix such that ||F||2 = 1. We define A; to be 
an eigenvalue of As that is the closest to Ao (it is well defined for € small 
enough), and let ze be an eigenvector of A, associated to Az. 

(a) Show that the mapping ¢ + Ae is continuous and that limz.9 Az = Ao. 
We admit that limz_.9 £e = Zo. 

(b) We denote by 6A = Az — ào the variation of A, and by da = x, — 29 
the variation of x. Prove that Ao(ðx) + eFax, = (SA)xe + Ao(6z). 

(c) Deduce that the mapping A : € > Ae is differentiable and that 


wya u 
Yo To 
(d) Explain why cond(A, Ao) = 1/|y§xo| is called “conditioning of the 
eigenvalue Ag.” 
3. Compute the conditioning of each eigenvalue of the matrix A. Explain the 
results observed in question 1. 


10.6 («). Write a function [1,u]=PowerD(A) that computes by the power 
method (Algorithm 10.1) the approximations / and u of the largest eigenvalue 
(in modulus) and corresponding eigenvector of A. Initialize the algorithm with 
a unit vector £o with equal entries. Test the program with each of the following 
symmetric matrices and comment the obtained results. 


0 15 0 9 0 1 2 -3 4 
0 fo 2 0 0 f2 1 4 -3 
2 J> PS] o o 15 oJ? SC=l-3 4 1 2 | 
-2 0 0 0 1 4 -3 2 1 


10.7. Compute by the power method and the deflation technique (see Remark 
10.3.3) the two largest (in modulus) eigenvalues of the following matrix 


Noor 


Modify the function PowerD into a function 1=PowerDef(A,u), where the 
iteration vector x, is orthogonal to a given vector u. 


10.8. Program a function [1,u]=PowerI(A) that implements the inverse 
power method (Algorithm 10.2) to compute approximations, l and u, of the 
smallest eigenvalue (in modulus) of A and its associated eigenvector. Test the 
program on the matrices defined by A=LaplacianidD(n), for different values 
of n. 
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10.9. Program a function T=HouseholderTri(A) that implements the House- 
holder algorithm to reduce a symmetric matrix A to a tridiagonal matrix T 
(following the proof of Proposition 10.5.1). Check with various examples that 


e the matrix T is tridiagonal (write a function for this purpose); 
e the spectra of the matrices A and T are the same. 


10.10. Program a function Givens(T,i) that computes the eigenvalue A; 
(labeled in increasing order) of a symmetric tridiagonal matrix T using the 
Givens method. Test the program by computing the eigenvalues of matrices 
obtained by the instructions 

u=rand(n,1);v=rand(n-1,1) ;T=diag(u)+diag(v,1)+diag(v,-1) 
for various values of n. 


10.11. By gathering the routines of the two previous exercises, program the 
Givens—Householder algorithm to compute the eigenvalues of a real symmetric 
matrix. Run this program to obtain the eigenvalues of the matrix A defined 
in Exercise 10.7. 


10.12. The goal is to numerically compute the eigenvalues of the matrix 


5 3 4 3 3 
3.5 2 3 3 
A=|4 2 42 4 
3.3 2 5 3 
33435 


by the Lanczos method. The notation is that of Lemma 10.7.1. 


1. Let ro = (1,2,3,4,5)’. Compute the sequence of vectors vj. Deduce ko, 
the critical dimension of the Krylov space associated with A and ro. 

2. For k = ko + 1, compute the matrix Tk, as well as its eigenvalues and 
eigenvectors. Which eigenvalues of A do you find in the spectrum of T? 

3. Answer the same questions for rg = (1,1,1,1, 1)‘, then for 
ro = (1,-1,0,1,-1)*. 


10.13. Let A=Laplacian2dD(n) be the matrix of the discretized Laplacian on 
an n xn square mesh of the unit square; see Exercise 6.7. The notation is that 
of the previous exercise. We fix n = 7. 


1. Starting with ro = (1,...,1)° € R”™, compute the matrix Th. Plot the 
eigenvalues of A (by a symbol) and those of T, (by another symbol) on 
the same graph. Are the eigenvalues of T,, good approximations of some 
eigenvalues of A? 

2. Same questions starting with the vector ro = (1,2,...,?)’. Comment on 
your observations. 
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Solution of Exercise 2.1 


1. 
2. 


u is the ith vector of the canonical basis of R”. 
The scalar product (v,u) = utv is computed by u’*v (or by the Matlab 
function dot). 


>> n=10;u=rand(n,1);v=rand(n,1); 

>> w=v-u? *v*u/ (Cu? *u) ; 

>> fprintf(’ the scalar product <w,u> = %f \n’,u’*w) 
the scalar product <w,u> = 0.000000 


The two vectors are orthogonal. We recognize the Gram-Schmidt ortho- 
normalization process applied to (u, v). 

The scalar product (Cx,x) = x«'Cx is computed by the Matlab instruc- 
tion x’ *C*x. 

(a) The matrix C being antisymmetric, i.e., Ct = —C, we have 


(Cz, x) = (x, Ctx) = —(z,Cz) = —(Cx, x) => (Cz, x) = 0. 


(b) Since A = B + C, we have (Az, x) = (Bx, x). The matrix B is called 
the symmetric part of A, and the matrix C its antisymmetric (or 
skew-symmetric) part. 


Solution of Exercise 2.2 The following functions are possible (nonunique) 
solutions. 


L 


function A=SymmetricMat (n) 

A=rand(n,n) ; 

A=A+A’; 

function A=NonsingularMat (n) 

A=rand(n,n) ; 

A=A+norm(A,’inf’); % make the diagonal of A dominant 


224 11 Solutions and Programs 


3. function A=LowNonsingularMat (n) 
A=tril(rand(n,n)); 
A=Atnorm(A,’inf’)*eye(size(A));% make the diagonal of A dominant 
4. function A=UpNonsingularMat (n) 
A=triu(rand(n,n)); 
A=Atnorm(A,’inf’)*eye(size(A));% make the diagonal of A dominant 
5. function A=ChanceMat (m,n, p) 


wnargin = number of input arguments of the function 
switch nargin % arguments of the function 
case 1 
m=n;A=rand(m,n) ; % The entries of A 
case 2 % take values 
A=rand (m,n) ; % between 0 and 1 
else 


A=rand(m,n) ;A=p*(2*A-1); % affine transformation 
end; 


We may call this function with 1, 2 or 3 arguments. 
6. function A=BinChanceMat (m,n) 
A=rand(m,n) ; 
A(A<0.5)=0;A(A>=0.5)=1; 
7. function H = HilbertMat (n,m) 
% this function returns a Hilbert matrix 
%nargin = number of input arguments of the function 
if nargin==1, m=n; end; 
H=zeros(n,m) ; 
for i=1:n 
for j=i:m 
H(i, j)=1/(itj-1); 
end; 
end; 


For square Hilbert matrices, use the Matlab function hilb. 
Solution of Exercise 2.7 Matrix of fixed rank. 


function A= MatRank(m,n,r) 

s=min(m,n) ;S=max(m,n) ; 

if r>min(m,n) 
fprintf(’The rank cannot be greater than %i ’,s) 
error(’Error in function MatRank. ’) 


else 
A=NonsingularMat (s); 
if m>=n 


A=[A; ones(S-s,s)]; Amx n matrix of rank r 
for k=rt+i:s,A(:,k)=rand()*A(:,r);end; 
else 
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A=[A_ ones(s,S-s)]; Amx n matrix of rank r 
for k=rt+i:s,A(k,:)=rand()*A(r,:);end; 
end; 
end 


Solution of Exercise 2.10 The Gram-Schmidt algorithm. 


1. To determine the vector up, we compute (p—1) scalar products of vectors 
of size m and (p — 1) multiplications of a scalar by a vector (we do not 
take into account the test if), that is, 2m(p — 1) operations. The total 
number of operations is then X3,- 2m(p — 1)  mn?. 

2. function B = GramSchmidt (A) 
pres=1.e-12; 

[m,n]=size(A) ; 
B=zeros(m,n) ; 
for i=1:n 
s=zeros(m,1); 
for k=1:i-1 
s=st+(A(:,i)’*B(:,k))*B(:,k); 
end; 
s=A(:,i)-s; 
if norm(s) > pres 
B(:,i) =s/norm(s) ; 
end; 
end; 


For the matrix defined by 


>> n=5;u=1:n; u=u’?; c2=cos(2*u); c=cos(u); s=sin(u) ; 
>> A=[u c2 ones(n,1) rand()*c.*c exp(u) s.*s]; 


we get 


>> U=GramSchmidt (A) 


Us 
0.1348 -0.2471 0.7456 0 0.4173 0 
0.2697 -0.3683 0.4490 0 -0.4791 0 
0.4045 0.8167 0.2561 0 0.1772 0 
0.5394 0.0831 -0.0891 O -0.5860 0) 
0.6742 -0.3598 -0.4111 0 0.4707 0 
(a) We have 
>> U’*U 
ans = 
1.0000 0.0000 0.0000 (0 0.0000 0 
0.0000 1.0000 -0.0000 0 -0.0000 0 
0.0000 -0.0000 1.0000 0 -0.0000 0 
0 0 0 0 0 0 
0.0000 -0.0000 -0.0000 0 1.0000 0 
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(0) 0 0 0 0 0 
Explanation: the (nonzero) columns of U being orthonormal, it is clear 
that (U‘U);,; = (ui, uj} = 6:,; if u; and uj are nonzero. If one of the 
vectors is zero, then (U'U);,; = 0. The situation is different for UU*: 
>> UxU’ 
ans = 

0.8092 0.2622 0.1175 -0.2587 0.0697 

0.2622 0.6395 -0.1616 0.3556 -0.0958 

0.1175 -0.1616 0.9276 0.1594 -0.0430 

-0.2587 0.3556 0.1594 0.6491 0.0945 

0.0697 -0.0958 -0.0430 0.0945 0.9745 
Note: If all columns of U are orthonormal, then U is nonsingular and 
Ut = U~!. In this case, we also obtain UU* = I (U is an orthogonal 
matrix). 

(b) The algorithm is stable in the sense that applied to an orthogonal 
matrix U, it provides the same matrix, up to a small remainder term. 
>> V=GramSchmidt (U) ;norm(V-U) 
ans = 

7.8400e-16 
3. The following function answers the question: 


function B = GramSchmidti (A) 
pres=1.e-12; 
colnn=0; % points out the current nonzero column 
[m,n]=size(A) ; 
B=zeros(m,n) ; 
for i=1:n 
s=zeros(m,1); 
for k=1:i-1 % we can describe k=1:colnn 
s=st+(A(:,i)’*B(:,k))*BC(:,k); 
end; 
s=A(:,i)-s; 
if norm(s) > pres 
colnn=colnnt1; 
B(:,colnn) =s/norm(s) ; 


end; 
end; 
>> W=GramSchmidt1 (A) 
W= 


0.1348 -0.2471 0.7456 0.4173 
0.2697 -0.3683 0.4490 -0.4791 
0.4045 0.8167 0.2561 0.1772 
0.5394 0.0831 -0.0891 -0.5860 
0.6742 -0.3598 -0.4111 0.4707 0 


Solution of Exercise 2.11 The modified Gram-Schmidt algorithm. 


O OO © 
OO O Q O 
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1. For p=1/n 
If ||ap|| A 0 then 
Up = ap/ ||ap|| 
Otherwise 
Up = 0 
End 
Fork=p+1/n 
Ak = Ak — (Ak, Up) Up 
End 
End 
Modified Gram-Schmidt algorithm 


Note that both algorithms have the same algorithmic complexity. Here is 
a Matlab programming of the modified Gram-Schmidt algorithm. 


function B = MGramSchmidt (A) 
pres=1.e-12; 
[m,n]=size(A) ; 
B=zeros (m,n); 
for i=1:n 
s=A(:,1i); 
if norm(s) > pres 
B(:,i) =s/norm(s) ; 
for k=iti:n 
AC:,k)=AC:,k)-(AC: ,k)?*B(: ,i)) *B(: i); 
end; 
else 
error(’linearly dependent vectors’) 
end; 
end; 


2. Comparison of the two algorithms. For a randomly chosen matrix (rand), 
we do not observe a noteworthy difference between the two algorithms. 
For a Hilbert matrix, we get 


>> n=10;A=hilb(n) ; 
>> U=GramSchmidt1 (A) ; V=MGramSchmidt (A) ; I=eye(n,n) ; 
>> norm(U’*U-I), norm(V’*V-I) 
ans = 
2.9969 
ans = 
2.3033e-04 


We remark on this example that the modified Gram-Schmidt algorithm 
is more accurate than the standard Gram-Schmidt algorithm. 
3. We improve the Gram-Schmidt algorithm by iterating it several times: 
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>> U=GramSchmidti1 (A) ;norm(U’ *U-I) 
ans = 
2.9969 

>> U=GramSchmidt1 (U) ;norm(U’ *U-I) 
ans = 

1.3377e-07 
>> U=GramSchmidt1(U) ;norm(U’ *U-I) 
ans = 

3.4155e-16 


Solution of Exercise 2.12 Warning: the running times obtained depend on 
what computer is used. However, their ordering on each computer is always 
the same. 


>> t1,t2 

ti = 
1.6910 

t2 = 
0.7932 


Explanation: it is clearly more efficient to declare beforehand the (large-sized) 
matrices by initializing them. This prevents Matlab from resizing the matrix 
at each creation of a new entry A; j. 


>> t2,t3 

t2 = 
0.7932 

t3 = 
0.7887 


Explanation: since matrices being stored column by column, it is slightly 
better to “span” a matrix in this order and not row by row, since the entries 
Aij and A;+1,; are stored in neighboring “slots” in the computer’s memory 
(for 1 < i < n — 1)). In this way, we reduce the access time to these slots. 


>> t3,t4 

t3 = 
0.7887 

t4 = 
0.0097 


Explanation: a tremendous speedup is obtained by using a “vectorizable” 
definition of the matrices. Efficient Matlab programs will always be written 
this way. 


Solution of Exercise 2.18 The following script produces Figure 11.1. 


>> A=[10 2; 24]; 
>> t=0:0.1:2*pi;x=cos(t)’;y=sin(t)’; 
>> x=[x;x(1)]sy=Lysy@)]; 
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>> for i=1:length(x) 
>> zid=[x(i),y(i)] *A* [x (i) sy (id; 
>> end; 


>>plot3(x,y,z,x,y,zeros(size(x)),’MarkerSize’ ,10,’LineWidth’ ,3); 
box 


o 


WOono — 


0 DoF 0.5 0 0.5 1 


Fig. 11.1. Computation of the Rayleigh quotient on the unit circle. 


We deduce from Figure 11.1 that the minimal value is close to 3.4 and the 
maximal value is close to 10.6. The spectrum of A is 


>> eig(A) 

ans = 
3.3944 
10.6056 


The eigenvalues are therefore very close to the minimal and maximal values 
of xt Ax on the unit circle. Since A is symmetric, we know from Theorem 2.6.1 
and Remark 2.6.1 that the minimum of the Rayleigh quotient is equal to the 
minimal eigenvalue of A and its maximum is equal to the maximal eigenvalue 


of A. 


Solution of Exercise 2.20 Let us comment on the main instructions in the 
definition of A: 

A=PdSMat (n) % symmetric matrix 

[P,D]=eig(A);D=abs(D); % diagonalization: PDP-!=A 

D=D+norm(D) *eye(size(D)) % the eigenvalues of D are > 0 

A=P*D*inv (P) % positive definite symmetric matrix 


1. We always have det (A) > 0, because the determinant of a matrix is equal 
to the product of its eigenvalues, which are positive here. 
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2. We easily check that all the main subdeterminants are strictly positive. 
For x € C* a nonzero vector, we denote by Žž the vector in C” obtained 
by adding n — k zero entries to x. We have 


x Aya = 7 Az > 0, 


since A is positive definite on R”. 
3. The spectrum of A; is not (always) included in that of A, as illustrated 
by the following example: 


>> n=5;A=PdSMat (n) ; Ak=A(1:3,1:3); 
>> eig(A) 
ans = 
10.7565 
5.5563 
6.6192 
6.0613 
6.2220 
>> eig(Ak) 
ans = 
9.0230 
5.6343 
6.1977 


Solution of Exercise 2.23 We fix n = 5. For A=rand(n,n), we have the 
following results: 


1. >> sp=eig(A) 


sp = 
2.7208 
-0.5774 + 0.12281 
-0.5774 - 0.12281 


0.3839 + 0.15701 
0.3839 - 0.1570i 
2. The command sum(abs(X) ,2) returns a column vector whose entry £ is 
equal to the sum of the entries of row £ of matrix X. 
(a) >> Gamma=sum(abs (A) ,2)-diag(abs (A) ) 
Gamma = 
2.5085 
3.5037 
1.2874 
1.8653 
2.1864 
(b) n=length(sp) ; 
for lambda=sp 
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ItIsTrue=0; 
for k=1:n h = length(Gamma) 
if abs(lambda-A(k,k) )<=Gamma(k) 
ItIsTrue=1; break; 
end; 
end; 
if ~ItIsTrue 
fprintf(’Error with the eigenvalue %f\n’ ,lambda) ; 
end; 
end; 


Thus the spectrum seems to be included in the union of the Gershgorin 
disks. 

We prove the previous remark, known as the Gershgorin—-Hadamard 
theorem. Let A € o(A) be an eigenvalue of matrix A and u(4 0) € R” 
a corresponding eigenvector. We have 


(A = ii Ui = X Qi juj. 


j+i 
In particular, if 7 is such that |u;| = max, |u,;|, we have 


[A = aii )uil < zi laigl lush 
|A = aia] < Dye lail Fp. 
[A= aiil < jz lai j| = Yi- 


We deduce that any A € o(A) belongs to at least one disk D;, and so 
o(A) C Ui Di. 

function A=DiagDomMat (n,dom) 

% returns a square strictly diagonally dominant matrix 

% dom determines the extent of the dominance of the diagonal 
A=rand(n,n) ; 

#nargin = number of input arguments of the function 

if nargin==1 dom=1, end; % default value 
A=A-diag(diag(A)) ; 

A=A+diag (sum(abs (A) ,2))+dom*eye (size (A) ) 

We note that the matrix A=DiagDomMat (n) is always nonsingular. 
Assume that A is singular. Then 0 is an eigenvalue of A, and by the 
previous calculation, there would exist an index į such that 0 € Dj, 


i.e., 
laial < 5 lai j| = Yis 
j+i 
which contradicts the fact that matrix A is diagonally dominant. Thus 
A is nonsingular. 
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4. function PlotGersh(a) 
% plots the Gershgorin-Hadamard circles 
[m,n]=size(a) ; 
if m~=n, error(’the matrix is not square’), end; 
d=diag(a) ; 
radii=sum(abs(a) ,2); 
radii=radii-abs(d) ; 
% define a large rectangle containing all the circles 
% determine the ‘‘lower-left’’ corner 
cornerx=real(diag(a))-radii;cornery=imag(diag(a))-radii; 
mx=min (cornerx);my=min (cornery);mx=round (mx-1);my=round (my-1); 
% we determine the ‘‘upper-right’’ corners 
cornerx=real(diag(a))+radii;cornery=imag(diag(a))+radii; 
Mx=max (cornerx);My=max (cornery);Mx=round (Mx+1);My=round (My+1); 
% specify eigenvalues by symbol + 
eigA=eig(a) ; 
plot (real(eigA) ,imag(eigA) ,’+’,’MarkerSize’ ,10,’LineWidth’ ,3) 
axis ( [mx Mx my My]); 
set (gca, ’XTick’ ,-5:2:10,’YTick’ ,-5:2:5, ’FontSize’ , 24) ; 
grid on; 
hh 
Theta = linspace(0,2*pi) ; 
for i=1:n 
X=real(a(i,i))+radii(i)*cos(Theta) ; 
Y=imag(a(i,i))+radii(i)*sin(Theta) ; 
hold on;plot (X,Y) %circle i 
end 
The eigenvalues of A and At being the same, we have at our disposal 
two bounds on the spectrum of A by applying the Gershgorin-Hadamard 
theorem to both matrices. We display in Figure 11.2 the Gershgorin circles 
of A (left) and those of A’ (right). Of course, we obtain a better bound 
of the spectrum of A by superposing the two figures. 
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Fig. 11.2. Gershgorin disks for a matrix (left) and its transpose (right). 


11.1 Exercises of Chapter 2 233 


Solution of Exercise 2.29 Another definition of the pseudoinverse matrix. 


1. >> m=10,n=7;A=MatRank(m,n,5); 
>> % determination of P 


>> X=orth(A’); % note: Im (At) =( Ker A)+ 
>> P=X#X?; 

2. >> X=orth(A); % image of A 
>> Q=X*X’ ; 


3. We note that 


>> norm(pinv(A)*A-P) 
ans = 

2.7981e-14 
>> norm(A*pinv(A)-Q) 
ans = 

1.5443e-14 


Let us show that indeed P = AtA and Q = AAl. 

(a) We first check that X = AtA is an orthogonal projection, i.e., that 
X = X* and X? = X. This is a consequence of the Moore-Penrose 
relations 

i. X* = (AA) = AA =X, 

ii. X? = (At A)(AtA) = (A AAt) A = AA = X. 
To prove that P = ATA, it remains to show that Im X = (Ker A)+. 
We know that (Ker A)+ = Im(A*). Let us prove that Im X = 
Im (A*). 
e For any y € Im (AtA), there exists 2 € C” such that 


y = Al Ax = (ATA)*a2 = A*(Al)*2 => y € Im A*. 
e For any y € Im (A*), there exists x € C™ such that y = A*x 
y = (AAlA)*x = (ATA)* A* x = ATAA*2 => y € Im (AŻ A). 


(b) Similarly, we now check that Y = AAt is an orthogonal projection: 
i. Y* S=(AA = Ad =Y, 
ii. Y? = (AAt) (AAt) = (AATA)AT = AATH=Y. 
It remains to show that Im Y = Im A. 
e yelm(A4t) Jr é€C™, y= AAtr y E€ ImA. 
e ycImA = Jre Cy = Ar = AAt Ar = yE Im (AA'). 

4. First, there exists at least one x such that Az = Qy, since Q = AAI, 
and thus the existence of at least one x; = Pg is obvious. Let us prove 
the uniqueness of xı. For x and & in C” such that Qy = Ax = AZ, since 
P = AŤ A, we have 


Arz = Aŭ = qr — ț € Ker A => zr — ț € Ker P = Pr = Po. 


By definition, Az = Qy = AAty, and multiplying by At and using the 
relation At = ATAAT, we deduce that At Ax = Aty. Since AA = P, we 
conclude that Aty = Pa = x1, which is the desired result. 
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11.2 Exercises of Chapter 3 


Solution of Exercise 3.3 We observe that all these norms coincide for 
a diagonal matrix A, and are equal to max; |a;,;|. Let us prove this result. 
We start with the case p = oo. Let x be such that |lz||, = 1 we have 
||Az||.o = max; |A;;a;|, so that || Al]. <max;|A;;|. Let J be an index for 
which |Az z| = max; |A;,;|. We have || Aez||.. = max; |A;,;|, which implies that 
|All = max; |Aj,i].- 

For p € [1, 00) and x such that ||x||p = 1, we have ||Dz||f = X; |Diiail?, 
whence ||Dz||> < (max; |Dj,i|”) |||], and so ||D||p < max; |D;,;|. We end the 
proof as in the case p = oo. 


Solution of Exercise 3.7 


1. We recall the bounds 


Aniallell2 S (Ax, £) < pee Ie Ee 
from which we easily infer the result. 
2. Plot of S4 for n= 2. 


(a) (42,2) =(4( 71), a) = |e, |? Ay setting Ap=(a(2), e 


Then, x = (£1, £2)* belongs to the intersection of T, and S4 if and 
only if |zı| = 1/,/Ap and z2 = p/./Ap. 
(b) We rotate the half-line x2 = px, around the origin and determine for 
each value of p the intersection of Ip and Sy. 
function [x,y]=UnitCircle(A,n) 
% add a test to check that the matrix 
% is symmetric positive definite 
>> x=[];y=[] ;h=2*pi/n; 
>> for alpha=0:h:2*pi 
>> p=tan (alpha) ; 
>> ap=[1 p]*A*[1;p]; 
>> if cos(alpha)>0 


>> x=[x 1/sqrt(ap)]; 

>> y=[y p/sqrt(ap)]; 

>> else 

>> x=[x -1/sqrt(ap)]; 
>> y=Ly -p/sqrt(ap)]; 
>> end; 

>> end; 


(c) >> n=100;A= [7 5; 5 7]; 
>> [x1,y1]=UnitCircle(A,n); 
>> [x2,y2]=UnitCircle(eye(2,2) ,n); 
>> plot(x1,y1,x2,y2,’.’,’MarkerSize’ ,10, ’LineWidth’ ,3) 
>> axis([-1.2 1.2 -1.2 1.2]);grid on; axis equal; 
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1-05 0 05 1 


Fig. 11.3. Unit circles for the Euclidean and the norm induced by the matrix A; 
see (3.12). 


>> set(gca, ’XTick’ ,-1:.5:1,’YTick’ ,-1:.5:1, ’FontSize’ ,24) ; 
In view of Figure 11.3, it seems that S4 is an ellipse. Let us prove 
that it is indeed so. Let A be a symmetric positive definite matrix of 
size n x n. It is diagonalizable in an orthonormal basis of eigenvectors, 
A= PDP": 


(Ag, 2) = 1 <=> (DP ‘a, Pix) = 1 4> Ñ dy? =1, 
w=1 


where y = P*g and À; are the eigenvalues of A. The last equation may 


be written 
i=1 Vr 


since the eigenvalues are positive. We recognize the equation, in the 
basis of eigenvectors y;, of an ellipsoid of semiaxes 1//)j. 

The results of the following instructions (with n = 100) are shown in 
Figure 11.4. 

>> [x1,y1]=UnitCircle(A,n); 

>> [x2,y2]=UnitCircle(B,n); 

>> [x3,y3]=UnitCircle(C,n) ; 

>> plot(xi,y1,x2,y2,’.’,x3,y3,’+’,... 

>> ?MarkerSize’ ,10,’LineWidth’ ,3) 

>> grid on; axis equal; 

>> set(gca, ’XTick’ ,-.8:.4:.8,’YTick’ ,-.8:.4:.8,... 

>> ’?FontSize’ ,24); 

The most elongated ellipse corresponds to the matrix C, and the least 
elongated one to A. Explanation (see previous question): the three 
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0.4 


-0.4 


-0.8 -0.4 0 04 08 
Fig. 11.4. Unit circles for the norms defined by matrices defined in (3.12). 


matrices have a common first eigenvalue, equal to 12. The second 
eigenvalue is equal to 2 for A, 1 for B, and 1/2 for C. 


Solution of Exercise 3.8 We fix the dimension n. The execution of the 
following instructions, 


>> A=rand(n,n); s=eigs(A,1) 
>> for i=1:10 


>> k=i*10; 
>> abs ((norm(A*k) ) * (1/k)-s) 
>> end 


shows that ||A*||!/* seemingly tends to e(A), whatever the chosen norm. Let 
us prove that it is actually the case. Consider a square matrix A and a matrix 
norm ||.||. 

1. Since A € o(A) => A* € o(A*), we have |A|¥ < o(A*) < ||A*||, and in 
particular, 0(A)* < @(A*) < ||A*|| so @(A) < jart. 

2. By construction of the matrix A-, we indeed have o( A+) < 1 and thus 
limk—+o || A*|| = 0. We deduce that there exists an index ko such that for 
all k > ko, we have || A*|| < 1, or even o(A) + e > ||A*||!/*. 

3. Conclusion: for all e > 0, there exists then ko such that k > ko implies 
that (A) < || A*||/* < o(A) +€, and therefore 


2) as ky1/k 
o(A) = lim AR). 
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11.3 Exercises of Chapter 4 


Solution of Exercise 4.1 


1. The computation of a scalar product (u,v) = >>/_, uivi is carried out in 
n multiplications (and n — 1 additions, but we do not take into account 
additions in operation counts). The computation of |lul]2 = y(u, u) is 
carried out in n multiplications and one square root extraction (which is 
negligible when n is large). Computing the rank-one matrix uv’ requires 
n? operations, since (uv); j = u;vj. 

2. We denote by a; the rows of A. Each entry (Au); being equal to the scalar 
product (a;, u), the total cost of the computation of Au is n? operations. 
We call b; the columns of matrix B. Each element (AB); j; being equal to 
the scalar product (a;,b;), the total cost for computing AB is n? opera- 
tions. 

3. The result of the instructions below is displayed in Figure 11.5. 


>> n1=500;in=1:5; 

>> for k=in; 

>> d=k*ni;a=rand(d,d);b=rand(d,d); 

>> tic;x=a*b;time(k)=toc; 

>> end 

>> n=ni*in’;plot(n,time,’-+’,’MarkerSize’ ,10,’LineWidth’ ,3) 
>> text(2200,.5,’n’,’FontSize’ ,24); 

>> text (600,5.2,’T(n)’,’FontSize’ ,24) 


| T(n) 


N O A OA OD 


n 
00 1000 1500 2000 2500 


Fig. 11.5. Running time for computing a product of two matrices. 


4. The assumption T(n) ~ Cn! defines an affine relation between logarithms 
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InT(n)) ~ slan + D, (11.1) 


with D = ln C. We display in Figure 11.6 the result of the following 
instructions: 

>> x=log(n);y=log(time); 

>> plot(x,y,’-+’,’MarkerSize’ ,10,’LineWidth’ ,3) 

>> text(7.5,-3.5,’log(n)’,’FontSize’ , 24) ; 

>> text(6.2,1.5,’log(T(n))’, ’FontSize’ , 24) 


log(T(n)) 


log(n) 
6 6.4 68 7.2 7.6 8 


Fig. 11.6. Running time for computing a product of two matrices (log-log scale). 


The curve is close to a line of slope s: 


>> s=(y(5)-y(1))/(«(5)-x(1)) 
a= 
2.9234 


The practical running time is close to the theoretical computational time. 
Solution of Exercise 4.2 


1. Since the matrix A is lower triangular, we have 


n i 
cij = X aikbr j = X Onde g- 
k=1 K= 


So c; j is computed in ¿ multiplications; moreover, there are n elements to 
be computed per row i. The total cost is 


a n(n+1) në 
Xni snm & —. 
= 2 2 
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2: Fori=1/n computing row i of matrix C 
For j=1/n 
s=0 
Fork=1 Zi 
S=st andr; 
End For k 
C7 8 
End For j 
End For i 


3. If both matrices A and B are lower triangular, so is the product C = AB. 


We have, for j < i, 
$ 
Gij = X Qi,kbk,j- 
k=j 


Hence, c; j is computed in 7 — j + 1 multiplications. Furthermore, there 
are only ¿į elements left to be computed per row i. The total cost is 


n 4 u a. LE deal n3 
Do LAT es, 


j=1 j=1 i=1 j=1 i=1 


4. The function LowTriMatMult defined below turns out to be slower than the 
usual matrix computation by Matlab (which does not take into account 
the triangular character of the matrices). 


function C=LowTriMatMult (A,B) 
% Multplication of two lower triangular matrices 
[m,n]=size(A) ; [n1,p]=size(B) ; 
if n7=n1 
error(’Wrong dimensions of the matrices’) 
end; 
C=zeros(m,p); 
for i=1:m 
for j=1:i 
s=0; 
for k=j:i 
s=stA(i,k)*B(k,j); 
end; 
Ci, j)=s; 
end; 
end; 


This is not a surprise, since the above function LowTriMatMult is not 
vectorized and optimized as are standard Matlab operations, such as the 
product of two matrices. In order to see the computational gain in taking 
into account the triangular structure of matrices, one has to compare 
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LowTriMatMult with a function programmed in Matlab language that 
executes the product of two matrices without vectorization. 

5. Now we compare LowTriMatMult with the function MatMult defined be- 
low: 


function C=MatMult (A,B) 
% Multiplication of two matrices 
% tests whether the dimensions are compatible 
% as in function LowTriMatMult 
[m,n]=size(A) ;p=size(B,2); 
C=zeros(m,p); 
for i=1:m 
for j=1:p 
s=0; 
for k=1:n 
s=stA(i,k)*B(k,j); 
end; 
Ci, j)=s; 
end; 
end; 


The comparison gives the advantage to LowTriMatMult: the speedup is 
approximately a factor 6: 


>> n=1000;a=tril(rand(n,n)) ;b=tril(rand(n,n)); 

>> tic;c=MatMult (a,b) ;t1=toc; 

>> tic;d=LowTriMatMult (a,b) ;t2=toc; 

>> ti, t2 

ti = 

23.5513 

t2 = 
4.1414 


6. With the instruction sparse, Matlab takes into account the sparse struc- 
ture of the matrix: 


>> n=300; 

>> a=triu(rand(n,n));b=triu(rand(n,n)); 

>> tic;a*xb;ti=toc 

>> sa=sparse(a) ;sb=sparse(b) ; 

>> tic;sa*sb;t2=toc 

ti = 

0.0161 

t2 = 
0.0291 


There is no gain in computational time. Actually, the advantage of sparse 
relies on the storage 
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>> n=300;a=triu(rand(n,n));spa=sparse(a) ; 
>> whos a 

Name Size Bytes Class 

a 300x300 720000 double array 
Grand total is 90000 elements using 720000 bytes 
>> whos spa 

Name Size Bytes Class 

spa 300x300 543004 double array (sparse) 
Grand total is 45150 elements using 543004 bytes 
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Solution of Exercise 5.2 


1. function x=ForwSub(A,b) 
% Computes the solution of system Ax = b 
% A is nonsingular lower triangular 
[m,n]=size(A) ;o=length(b) ; 
if m~=n | o”~=n, error(’dimension problem’), end; 
small=1.e-12; 
if norm(A-tril(A),’inf’)>small 
error(’non lower triangular matrix’) 

end; 
x=zeros(n,1); 
if abs(A(1,1))<small, error(’noninvertible matrix’), end; 
x(1)=b(1)/A(1,1); 
for i=2:n 

if abs(A(i,i))<small 

error(’noninvertible matrix’) 

end; 

x(i)=(b(i)-ACi,1:1-1)*x(1:i1-1))/ACi,i); 
end; 

2. function x=BackSub(A,b) 
% Computes the solution of system Ax = b 
% A is nonsingular upper triangular 
[m,n]=size(A) ;o=length(b) ; 
if m~=n | o”~=n, error(’dimension problem’), end; 
small=1.e-12; 
if norm(A-triu(A),’inf’)>small 
error(’non upper triangular matrix’) 

end; 
x=zeros(n,1); 
if abs(A(n,n))<small, error(’noninvertible matrix’), end; 
x(n)=b(n)/A(n,n); 
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for i=n-1:-1:1 
if abs(A(i,i))<small 
error(’noninvertible matrix’) 
end; 
x(i)=(b(i)-ACi, iti:n)*x(iti:n))/ACi,i); 
end; 


Solution of Exercise 5.3 We store in a vector u the lower triangular matrix 
A row by row, starting with the first one. The mapping between indices is 
Ai j = Uj+ii—1)/2- Here are the programs: 


1. function aL=StoreL() 
fprintf(’Storage of a lower triangular matrix’) 
fprintf(’the matrix is stored row by row’) 
n=input(’enter the dimension n of the square matrix’) 
for i=i:n 
fprintf(’row %i \n’,i) 
ii=i*(i-1)/2; 
for j=1:i 
fprintf(’enter element (4i,%i) of the matrix’ ,i,j) 
aL(j+ii)=input(’ ’); 
end; 
end; 
2. function y=StoreLpv(a,b) 
% a is a lower triangular matrix stored by StoreL 
[m,n]=size(b) ; 
if n™=1 
error(’b is not a vector’) 
end; 
if m*(mt+1)/2~=length(a) 
error(’?incompatible dimensions’) 
end; 
for i=1:m 
ii=i*(i-1)/2; 
s=0; 
for j=1:i 
s=sta(jtii)*b(j); 
end; 
y(i)=s; 
end; 
The inner loop could be replaced by the more compact and efficient in- 
struction 


y (i) =a(iitt:i)*b(1:4); 


3. function x=ForwSubL (a,b) 
4 a is a lower triangular matrix stored by StoreL 
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% add the compatibility tests of function StoreLpv (a,b) 
m=length(b) ; 
for i=1:m 

ii=i*(i-1)/2; 

s=0; 

for j=1:i-1 

s=sta(j+ii)*x(j); 

end; 

x(i)=(b(i)-s)/a(itii); % check if a(itii) is zero 
end; 


Solution of Exercise 5.13 Hager algorithm. 
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1. Proposition 3.1.2 furnishes a formula for || Al], (at a negligible computa- 


tional cost): 


nm 
|All = ums, (SE laa). 
i= 


2. From the previous formula we deduce that there exists an index jo such 


that n 
IAT = X Ail = Ali, 
i=1 
where (A~'),, is the jo column of A~!. Hence 


IAT" = A ego lli = lejo). 
3. We write 


n n 


fe) =X 5 (4 “*2):| = > [Zi] = > siti = (&, 8). 


i=l 


4. We compute "(a — x): 


T (a — 2) = (A~*s)*(a—z) = s+ A™ (a — x) = (A ‘a, 8) — (2, 8). 


According to question 3, 


f(a) +3 (a = x) = (Aa, 8) = (Aa) 55. 


j=1 


Each term in this sum is bounded from above by |(A~‘a),;|; we therefore 


deduce 
f(a) + #(a—2) < X (Ata); = Atali = f(a), 
j=l 


from which we get the result. 


244 11 Solutions and Programs 
5. According to question 4, we have 
Fes) — f(@) > z (ej — x) = (2, e;) — (x, 2) = zj — (x, 2) > 0, 


which yields the result. 

6. (a) According to question 3, we have f(y) = (J, s’), where s’ is the sign 
vector of 7. However, for y close enough to x, the sign of 7 = AT !y is 
equal to the sign of & (we have assumed that %; 4 0 for all 7). Thus 
we have 


f(y) = (8) = f(x) + (y—@, 8) = f(a) + (Ay — 2), 8) 
= f(z) + s'A*(y—2), 


which proves the result. 
(b) x is a local maximum of f if and only if for all y close enough to a, 
we have s‘A~1(y — x) < 0. We write 


st A`! (y — x) = (A~ (y = 2), 8) = (y — 2, A's) = (y — 2,2) 
= (y, 2) — (x, T). 
We infer that if ||Z||.. < (x, z), we have 
s'A™} (y = 2) < (y,2) = [Ello < (yli = DIllZll = 0, 


fory es. 
7. The Hager algorithm in pseudolanguage: 


choose x of norm |||; = 1 
compute 
& by A&=2 
s by si = sign(%;) 
compute z by Atz = s 
While ||Z||.. > (x, z) 
compute j such that z; = ||Z||o. 
set © = ej 
compute č, s and z 
If |ižllo < (2,2) 
cond: (A) =~ |All [loc 
End If 
End While 
8. The Hager algorithm in Matlab. 


function conda=Cond1 (a) 

% computes an approximation of the 1-norm 
% condition number of a square matrix 

% by optimization criteria 
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% initialization 
n=size(a,1); 
x=ones(n,1)/n; h «@=1/n 
xt=a\x;ksi=sign(xt) ;xb=a’\ksi; 
notgood=norm(xb, ’inf’)>xb’ *x; 
conda=norm(xt,1)*norm(a,1); 
% loop 
while notgood 
[maxx , j]=max (abs (xb) ) ; 
x=zeros(n,1);x(j)=1; 
xt=a\x;ksi=sign(xt) ;xb=a’\ksi; 
if norm(xb,’xinf’)<=xb’*x 
conda=norm(xt,1)*norm(a,1) ; 
notgood=0; 
else 
[maxx , j]=max (abs (xb) ) ; 
x=zeros(n,1);x(j)=1; 
end; 
end; 
This algorithm gives remarkably precise results: 
>> n=5;a=NonsingularMat (n) ; 
>> [Condi(a), norm(a,1)*norm(inv(a) ,1)] 
ans = 
238.2540 238.2540 
>> n=10;a=NonsingularMat (n); 
>> [Condi(a), norm(a,1)*norm(inv(a) ,1)] 


ans = 
900.5285 900.5285 


Solution of Exercise 5.15 Polynomial preconditioner C~! = p( A). 
function PA=PrecondP(A,k) 
% Polynomial preconditioning 


% check the condition |I— A|| < 1 
PA=eye(size(A)); 


if k~=0 
C=eye(size(A))-A; 
if max(abs(eig(C)))>=.99 % compute o(C) 
error(’look for another preconditioner’ ) 
end; 
for i=1:k 
PA=PA+C7i; 
end; 
end; 


Some examples: 
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>> n=10;a=PdSMat (n) ;a=a/(2*norm(a)) ; 
>> cond(a) 
ans = 

1.9565 
>> ap=PrecondP(a,1);cond(ap*a) 
ans = 

1.6824 
>> ap=PrecondP (a, 10) ; cond (ap*a) 
ans = 

1.0400 
>> ap=PrecondP (a, 20) ; cond (ap*a) 
ans = 

1.0020 


The larger k is, the better is the conditioning; however, its computational cost 
increases. 


Solution of Exercise 5.16 Finite difference approximation of Laplace equa- 
tion. 


1. Computation of the matrix and right-hand side of (5.19). 
(a) The following function gives the value of the matrix Ay. 
function A=Laplacian1dD(n) 
% computes the 1D Laplacian matrix 
% discretized by centered finite differences 
% with Dirichlet boundary conditions. 
A=zeros(n,n); 
for i=i:n-1 


ACi,i)=2; 
A(i,it1)=-1; 
ACit1,i)=-1; 

end; 

A(n,n)=2; 

A=A* (n+1)72; 


There are other ways of constructing An. For instance, 
i. Function toeplitz(u), which returns a symmetric Toeplitz ma- 
trix whose first row is the vector u. The instruction 
Ah=toeplitz([2, -1, zeros(1, n-2)])*(n+1)°2 
then defines A,,. 

ii. The following instructions define also A,. 
u=ones(n-1,1);v=[1;u] ;a=(m+1)°2; 
Aht=2*a.*diag(v,0)-a.*diag(u,-1)-a.*diag(u, 1); 

iii. The reader can check that the instructions 
A=[eye(n,n) zeros(n,1)];A=eye(n,n)-A(: ,2:n+1); 
A=A+A? ; A=A*(n+1)72; 
also define the matrix Apn. 
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(b) The following function gives the value of the right-hand side b., 
function v=InitRHS(n) 
x=1:n;h=1./(n+1); 
x=h*x? ; v=f (x); 
It calls a function f that corresponds to the right-hand side f(x) of 
the differential equation. 
2. Validation. 
(a) When f(a) = 1, u°(x) = x(1— x)/2 is the solution of problem (5.18). 
function [exa,sm]=Check(n) 
x=1:n;h=1./(n+1) ;x=h*x’; 
exa=x.*(1-x)/2; 
sm=ones(n, 1); 
Examples: 
>> n=10;Ah=Laplaciani1dD(n) ; 
>> [exasol,sm]=Check(n) ;sol=Ah\sm; 
>> norm(sol-exasol) 
ans = 
2.6304e-16 
We note that no error was made; the solution is exactly computed 
with the accuracy of the machine. This was predictable, since the 
discretization of u” performed in (5.19) is exact for polynomials of 
degree less than or equal to 3. 
(b) Convergence of the method. We use here the function 
function [exa,sm]=Check2(n) 
x=1:n;h=1./(nt1) ;x=h*x’; 
exa=(x-1).*sin(10*x) ; 
sm=-20*cos (10*x)+100* (x-1) .*sin(10*x) ; 
and the approximation error is given by the script 
for i=1:10 
n=10*i; [exasol,sm]=Check2(n) ; 
Ah=Laplacian1dD(n) ;sol=Ah\sm; 
y (i)=log(norm(sol-exasol,’inf’)); 
x(i)=log(n) ; 
end 
plot(x,y,’.-’,’MarkerSize’ ,20,’LineWidth’ ,3) 
grid on; 
set (gca, ’XTick’ ,2:1:5,’YTick’ ,-7.5:1:-3, ’FontSize’ ,24) ; 
In view of Figure 11.7, the logarithm of the error is an affine function 
of the logarithm of n; the slope of this straight line is about —2. Thus, 
we deduce that ||sol — exasol||.. ~ C*** x n~?, which was predicted 
by Theorem 1.1.1. 
3. Eigenvalues of the matrix An. 
(a) Solving the differential equation u” + Au = 0 shows that the eigen- 
values of the operator u œ> —u” endowed with homogeneous Dirichlet 
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2 3 4 5 


Fig. 11.7. Logarithm of the error in terms of logarithm of n (log-log scale). 


(c) 


boundary conditions are exactly Ap = k?x?, for k > 0, with cor- 
responding eigenfunctions y(x) = sin(kaa). When n tends to +00 
(i.e., h tends to 0) and for a fixed k, we obtain the limit 


4. krh 
Àh,k = 7a sin’ (=) 7 ken? = pom 


In other words, the “first” eigenvalues of Apn are close to the eigenval- 
ues of the continuous operator; see Figure 11.8. 

n=20;x=(1:n)? ;h=1./(nt1); 

Ahev=(1-cos(pi.*h.*x)).*2./h./h; 

contev=pi.*pi.*x.*x; 

plot(x,contev,x,Ahev,’.-’,’MarkerSize’ ,20, ’LineWidth’ ,3) 
The function eig(X) returns a vector containing the eigenvalues of 
matrix X: 

n=500; Ah=LaplacianidD(n) ; 

x=(1:n)? ;h=1./(n+1); 

exacev=(1-cos(pi.*h.*x)) .*2./h./h; 

mat labev=eig(Ah) ; 

plot (x,matlabev-exacev) 
The eigenvalues of A, are accurately computed by Matlab: for exam- 
ple, for n = 500 the maximal error committed is less than 107°. 
We plot on a log-log scale the condition number of Apn: 

for i=1:7 

n=50*i;Ah=LaplacianidD(n) ; 
en(i)=log(n) ;condn(i)=log(cond (Ah) ) ; 
end 
plot (en, condn) 
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4 


x 10 


4 8 12 16 20 0 20 40 60 80 100 


Fig. 11.8. Comparison of the eigenvalues of the continuous and discrete problems 
for n = 20 (left) and n = 100 (right). 


We observe that the 2-norm conditioning of A, behaves as C*° x n?. 
This result agrees with the theory, since, A, being symmetric, its 
2-norm conditioning is equal to the ratio of its extremal eigenvalues 


max, [Ank] Ay 4 2 


condo (An) = zn 


mink \An,k| ~ T 


We have, of course, An = A, + ch, and the eigenvalues of Ay are 
An,k = An,k + C. In particular, the first eigenvalue is 


~ 4 h 
Ahi = qasin” (=) +c. 


We solve the linear system for n = 100, f = 1. 
>> n=100;Ah=Laplacian1dD (n); 


>> s=eig(Ah);c=-s(1); % Warning: it may be necessary 
>> At=Ahtc*eye(n,n); % to sort out vector s with 
>> b=InitRHS(n) ;sol=At\b; % command sort 


To check the reliability of the result we compute the norm of the 
solution sol given by Matlab: 

>> norm(sol) 

ans = 

1.0470e+12 

which is a clear indication that something is going wrong .... Since c is 
an eigenvalue of A,,, the matrix A,, is singular. Thanks to the rounding 
errors, Matlab nevertheless find a solution without any warning. 
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11.5 Exercises of Chapter 6 


Solution of Exercise 6.3 The function Gauss(A,b) computes the solution 
of the linear system Ax = b by the Gauss method; the pivot in step k is the first 
nonzero entry in the set (A; k)j>x. In the function GaussWithoutPivot there 
is no pivot strategy; the program stops if the entry Ak, is too small. Partial 
pivoting and complete pivoting are used in the functions GaussPartialPivot 
and GaussCompletePivot. 


1. function x=Gauss(A,b) 
% solve system Ax = b by 
% the Gauss method with partial pivoting 
[m,n]=size(A) ;o=length(b) ; 
if m~=n | o”=n, error(’dimension problem’), end; 
% Initialization 
small=1.e-16; 
for k=1:n-1 
% Search for the pivot 
u=A(k,k:n) ;pivot=A(k,k) ;i0=k; 
while abs(pivot)<small 
i10=i0+1;pivot=A(i0,k); 
end; 
if abs(pivot)<small 
error(’singular matrix’) 


end; 
% Exchange rows for A and b 
if i0°=k 


u=A(i0,k:n) ;A(i0,k:n)=A(k,k:n) ;A(k,k:n)=u; 
s=b(i0) ;b(i0)=b(k) ;b(k)=s; 
end 
for j=kt+ti:n 
s=A(j,k)/pivot;v=A(j,k:n); 
A(j,k:n)=v-s*u;b(j)=b(j)-s*b(k) ; 
end; 
end; 
% A= An is an upper triangular matrix 
% we solve Anz = bn by back substitution 
x=zeros(n,1); 
if abs(A(n,n))>=small 
x(n)=b(n)/A(n,n) ; 
else 
error(’singular matrix’) 
end; 
for i=n-1:-1:1 
x(i)=(b(i)-ACi, it4:n)*x(iti:n))/ACi,i); 
end; 
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2. function x=GaussWithoutPivot (A,b) 
% solve system Ax = b by 
% the Gauss method without pivoting 
[m,n]=size(A) ;o=length(b) ; 
if m~=n | o”=n, error(’dimension problem’), end; 
% Initialization 
small=1.e-16; 
for k=1:n-1 
u=A(k,k:n) ;pivot=A(k,k) ;i0=k; 
if abs(pivot)<small, error(’stop: zero pivot’), end; 
for j=kti:n 
s=A(j,k)/pivot;v=A(j,k:n); 
A(j,k:n)=v-s*u;b(j)=b(j)-s*b(k) ; 
end; 
end; 
% A= A, is an upper triangular matrix 
% we solve Anz = bn by the back substitution method 
x=zeros(n,1); 
if abs(A(n,n))>=small 
x(n)=b(n)/A(n,n); 
else 
error(’singular matrix’) 
end; 
for i=n-1:-1:1 
x(i)=(b(i)-ACi, itt:n)*x(iti:n))/ACi,i); 
end; 
3. function x=GaussPartialPivot(A,b) 
% solve system Ax = b by 
% the partial pivoting Gauss method 
[m,n]=size(A) ;o=length(b) ; 
if m~=n | o”=n, error(’dimension problem’), end; 
% Initialization 
small=1.e-16; 
for k=1:n-1 
B=A(k:n,k); 
% We determine io 
[pivot , index]=max(abs(B)) ; 
if abs(pivot)<small 
error(’singular matrix’) 
end; 
i0=k-1+index(1,1);; 
% Exchange rows for A and b 
if i0”~=k 
u=A(i0,k:n) ;A(i0,k:n)=A(k,k:n) ;A(k,k:n)=u; 
s=b(i0) ;b(i0)=b(k) ;b(k)=s; 
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end 
% We carry out the Gauss elimination 
u=A(k,k:n) ;pivot=A(k,k); 
for j=kti:n 
s=A(j,k)/pivot;v=A(j,k:n); 
A(j,k:n)=v-s*u;b(j)=b(j)-s*b(k) ; 
end; 
end; 
% A= A, is an upper triangular matrix 
% we solve Anz = bn by the back substitution method 
x=zeros(n,1); 
if abs(A(n,n))>=small 
x(n)=b(n)/A(n,n); 
else 
error(’singular matrix’) 
end; 
for i=n-1:-1:1 
x(i)=(b(i)-ACi, itt:n)*x(iti:n))/ACi,i); 
end; 
4. function x=GaussCompletePivot (A,b) 
% solve system Ax = b by 
% the complete pivoting Gauss method 
[m,n]=size(A) ;o=length(b) ; 
if m~=n | o”~=n, error(’dimension problem’), end; 
% Initialization 
small=1.e-8; 
ix=1:n; 
for k=1:n-1 
B=A(k:n,k:n); 
% We determine io, jo 
[P,1]=max(abs(B)) ; 
[p,index]=max(P); pivot=B(1I (index) , index) ; 
if abs(pivot)<small, error(’singular matrix’), end; 
i0=k-1+I (index) ; j]0=k-1+index; 
% Exchange rows for A and b 
if i0”~=k 
u=A(i0,k:n) ;A(i0,k:n)=A(k,k:n) ;A(k,k:n)=u; 
s=b(i0) ;b(i0)=b(k) ;b(k)=s; 


end 
% Exchange columns of A and rows of x 
if j07=k 


u=A(:,k);A(:,k)=AC:, j0) ;AC: ,j0)=u; 
s=ix(j0O) ;ix(j0)=ix(k) ;ix(k)=s; 
end 
% We carry out the Gauss elimination 
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u=A(k,k:n) ;pivot=A(k,k); 
for j=kti:n 
s=A(j,k)/pivot; 
v=A(j,k:n); 
A(j,k:n)=v-s*u; 
b(j) =b(j) -s*b (k) ; 
end; 
end; 
% A= A, is an upper triangular matrix 
% we solve An£n = bn by the back substitution method 
y=zeros(n,1); 
if abs(A(n,n))>=small 
y (n)=b(n)/A(n,n) ; 
else 
error(’singular matrix’) 
end; 
for i=n-1:-1:1 
y(i)=(b(i)-A(i,i+1:n)*y(i+1:n))/A(i,i); 
end; 
% we rearrange the entries of x 
x=zeros(n,1);x(ix)=y; 
. (a) Dividing by a small pivot yields bad numerical results because of 
rounding errors: 
>> e=1.E-15; 
>> A=[e 1 131 1 -13;1 1 2];x=[1 -1 1]? ; b=Ax*x; 
>> norm(Gauss(A,b)-x) 
ans = 
7.9928e-04 
>> norm(GaussPartialPivot (A,b)-x) 
ans = 
2.2204e-16 
(b) Now we compare the strategies complete pivoting/partial pivoting. 

i. We modify the headings of the programs by adding a new out- 
put argument. For instance, we now define the function Gauss, 
function [x,g]=Gauss(A,b). The rate g is computed by adding 
the instruction g=max (max (abs (A)))/a0 just after the computa- 
tion of the triangular matrix. The variable a0=max (max (abs (A) )) 
is computed at the beginning of the program. 

ii. For diagonally dominant matrices, we note that the rates are all 
close to 1. 

>> n=40;b=rand(n,1);A=DiagDomMat (n) ; 

>> [x gwp]=GaussWithoutPivot (A,b) ; [x,g]=Gauss(A,b) ; 
>> [x gpp]=GaussPartialPivot(A,b) ; 

>> [x,gcp]=GaussCompletePivot (A,b) ; 

>> [gwp, g, gpp, gcp] 
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iii. 


ans = 


0; 


9900 0.9900 0.9900 1.0000 


The same holds true for positive definite symmetric matrices. 
>> n=40;b=rand(n,1);A=PdSMat (n); 


>> [x 
>> [x 


>> [x, 


gwp]=GaussWithoutPivot(A,b);[x,g]=Gauss(A,b); 
gpp]=GaussPartialPivot(A,b) ; 
gcp]=GaussCompletePivot (A,b) ; 


>> n=40;b=rand(n, 1) ;A=PdSMat (n); 
>> [gwp, g, gpp; gcp] 


ans = 


Oi; 


9969 0.9969 0.9969 1.0000 


We can therefore apply, without bad surprises, the Gauss method 
to these matrix classes. For “random” matrices, the result is dif- 


ferent: 


>> n=40;b=rand(n,1);A=rand(n,n); 


>> [x 
>> [x 


>> [x, 


gwp]=GaussWithoutPivot (A,b) ; [x,g]=Gauss(A,b) ; 
gpp]=GaussPartialPivot(A,b) ; 
gcp]=GaussCompletePivot (A,b) ; 


>> [gwp, g, gpp, gcp] 


ans = 


128. 


3570 128.3570 3.0585 1.8927 


For these matrices, it is better to use the Gauss methods with 

partial or complete pivoting. 

Comparison of 0p, and Ogor- 

A. The following instructions produce Figure 11.9, which clearly 
shows that complete pivoting is more stable. The drawback 
of this method is its slow speed, since it performs (n — k)? 
comparisons (or logical tests) at each step of the algorithm, 
whereas partial pivoting requires only n — k comparisons. 


>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 


for k=1:10 
n=10*k;b=ones(n,1); 
A=n*NonSingularMat (n) ; 
[x g]=GaussPartialPivot(A,b) ;Pp1(k)=g; 
[x g]=GaussCompletePivot (A,b) ;Cpl(k)=g; 
A=n*NonSingularMat (n); 
[x g]=GaussPartialPivot(A,b) ;Pp2(k)=g; 
[x g]=GaussCompletePivot (A,b) ;Cp2(k)=g; 
A=n*NonSingularMat (n) ; 
[x g]=GaussPartialPivot(A,b) ;Pp3(k)=g; 
[x g]=GaussCompletePivot (A,b) ;Cp3(k)=g; 
end; 


n=10*(1:10); 
plot(n,Pp1,’+’,n,Pp2,’+’,n,Pp3,’+’,n,Cpi,... 
"X ,.n,Cp2,’x’,n,Cp3,’x’,’MarkerSize’,10,... 
?LineWidth’ ,3) 
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>> set(gca, ’XTick’ ,0:20:100,’YTick’ ,0:3:6,... 


>> FontSize’ ,24); 
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Fig. 11.9. Growth rate, in terms of n, for several runs of Gaussian elimination with 


partial pivoting (+) and complete pivoting (x). 


B. The following matrix has been cooked up so that @g4,, = 2"~1, 
while, as for any matrix, 0,5, does not grow much more 
quickly than n. However, in usual practice, the partial piv- 
oting strategy is as efficient as the complete pivoting Gauss 


method. 
>> for k=1:5 
>> n=2*k;b=ones(n,1); 
>> A=-tril(ones(n,n))+2*diag(ones(n,1)); 


>> A=A+[zeros(n,n-1) [ones(n-1,1);0]]; 


>> [x g]=GaussPartialPivot(A,b) ;Pp(k)=g; 
>> [x g]=GaussCompletePivot (A,b) ;Cp(k)=g; 


>> end; 

>> dim=2*(1:5); 

[dim; Pp;Cp] 

ans = 
2 4 6 8 10 
2 8 32 128 512 
2 2 2 2 2 


Solution of Exercise 6.5 Storage of a band matrix. 


1. function aB=StoreB (p) 


fprintf(’Storage of a triangular band matrix\n’) 


fprintf(’of bandwidth 2*p+1\n’) 
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fprintf(’the matrix is stored row by row \n’) 
n=input(’enter the dimension n of the square matrix ’) 
for i=i:n 
fprintf(’row %i \n’,i) 
ip=(2*i-1) *p; 
for j=max(1,i-p) :min(n,i+p) 
fprintf(’enter entry (4i,/i) of the matrix’,i,j) 
aB(j+ip)=input(’ ’); 
end; 
end; 
2. function aBb=StoreBpv(a,p,b) 
% ais a band matrix 
% stored by StockB 
% Warning: execute all compatibility tests 
% here we assume that a and b are compatible 
[m,n]=size(b) ;aBb=zeros(m,1); 
for i=1i:m 
ip=(2*i-1) *p; 
s=0; 
for j=max(1,i-p) :min(m, itp) 
s=sta(j+ip) *b(j); 
end; 
aBb(i)=s; 
end; 


Solution of Exercise 6.6 


function aB=LUBand(aB,p) 
% Compute the LU factorization of 
% a band matrix A with half-bandwidth p. 
% A (in aB), L and U (also in aB) 
% are in the form computed by program StoreB 
n=round (length (aB) /(2*pt+1)) ;zero=1.e-16; 
for k=1:n-1 
if abs(aB((2*k-1)*ptk))<zero, error(’error : zero pivot’) ,end; 
Ind=min(n,ktp) ; 
for i=(k+1):Ind 
aB((2*i-1)*pt+k)=aB((2*i-1)*ptk) /aB( (2*k-1) *p+tk) ; 
IndR=(k+1) : Ind; 
aB((2*i-1)*pt+IndR)=aB((2*i-1)*p+IndR)- ... 
aB((2*k-1) *pt+IndR) *aB((2*i-1) *ptk) ; 
end; 
end; 
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Solution of Exercise 7.5 The following function computes the QR factor- 
ization of a square matrix by the Householder algorithm. 


function [Q,R]=Householder (A) 
% QR decomposition computed by 
% the Householder algorithm 
% Matrix A is square 
[m,n]=size(A) ; 
R=A;Q=eye(size(A)); 
for k=1:m-1 
i=k-1;j=n-i;v=R(k:n,k) ;w=vtnorm(v)*[1;zeros(j-1,1)]; 
Hw=House(w) ; 
Hk=[eye(i,i) zeros(i,j); zeros(j,i) Hw]; 
Q=Hk*Q ; 
R(k:n,k:n)=Hw*R(k:n,k:n); 
end; 
Q=Q’ ; 
The function House computes the Householder matrix of a vector. 


function H=House(v) 
% Elementary Householder matrix 
({n,m]=size(v) ; 
if m=1 

error(’enter a vector’) 
else 

H=eye (n,n); 

n=norm(v) ; 

if n>1.e-10 

H=H -2*v*v’?/n/n; 

end; 

end; 


We check the results. 


>> n=20;A=rand(n,n);[Q,R]=Householder (A); 
% we check that QR = A 
>> norm(Q*R-A) 
ans = 
3.3120e-15 
% we check that Q is unitary 
>> norm(Q*Q’-eye(size(A))), norm(Q’*Q-eye(size(A) )) 
ans = 
1.8994e-15 
ans = 
1.8193e-15 
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% we check that R is upper triangular 
>> norm(R-triu(R) ) 
ans = 

4.3890e-16 


Finally, we compare the results with the modified Gram-Schmidt method of 
Exercise 2.11. 


>>n=10;A=HilbertMat (n) ;B=MGramSchmidt (A) ;norm(B*B’ -eye(size(A))) 
ans = 
2.3033e-04 
>> [Q,R]=Householder (A) ;norm(Q*Q’-eye(size(A))) 
ans = 
1.0050e-15 


For larger values of n, we have 


>>n=20; A=HilbertMat (n) ;B=MGramSchmidt (A) ;norm(B*B’-eye(size(A))) 
?? Error using ==> MGramSchmidt 
linearly dependent vectors 
>> [Q,R]=Householder (A) ;norm(Q*Q’-eye(size(A))) 
ans = 
1.6434e-15 


The Householder method is more robust and provides an orthogonal ma- 
trix, whereas the Gram-Schmidt algorithm finds zero vectors (more precisely 
of norm less than 107”). 


11.7 Exercises of Chapter 8 


Solution of Exercise 8.6 Program for the relaxation method. 


function [x, iter]=Relax(A,b,w,tol,MaxIter,x) 
% Computes by the relaxation method the solution of system Ar=b 
% w = relaxation parameter 
% tol = e of the termination criterion 
% MaxIter = mazimum number of iterations 
% x= Xo initial vector 
([m,n]=size(A); 
if m™=n, error(’the matrix is not square’), end; 
if abs(det(A)) < 1.e-12 
error(’the matrix is singular’) 


end; 
if “w, error(’omega = zero’);end; 
% nargin = number of input arguments of the function 


% Default values of the arguments 
if nargin==5 , x=zeros(size(b));end; 
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Fig. 11.10. Relaxation method: number of iterations in terms of w. 


if nargin==4 , x=zeros(size(b)) ;MaxIter=200; end; 
if nargin==3 , x=zeros(size(b)) ;MaxIter=200;tol=1.e-4;end; 
if nargin==2 , x=zeros(size(b)) ;MaxIter=200;tol=1.e-4;w=1; end; 
M=diag((1-w) *diag(A)/w)+tril(A); 
% Initialization 
iter=0;r=b-A*x; 
%, Iterations 
while (norm(r)>tol)&(iter<MaxIter) 
y=M\r; 
X=xty; 
r=r-A*y; 
iter=iter+1; 
end; 


We run the program for w € (0.1, 2). 
>> n=20;A=LaplacianidD(n) ;b=sin((1:n)/(n+1))’;so1=A\b; 


>> pas=0.1; 

>> for i=1:20 

>> omega(i)=i*pas; 

>> [x, iter]=Relax(A,b,omega(i),1.e-6,1000,zeros(size(b))); 
>> itera(i)=iter; 

>> end; 


>> plot (omega,itera,’-+’,’MarkerSize’ ,10,’LineWidth’ ,3) 


In view of the results in Figure 11.10 (left), it turns out that the optimal 
parameter is between 1.7 and 1.9. Hence, we zoom on this region. 


pas=0.01; 

for i=1:20 
omega(i)=1.7+i*pas; 
[x, iter]=Relax(A,b,omega(i) ,1.e-6,1000,zeros(size(b))); 
itera(i)=iter; 
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end; 
plot (omega, itera,’-+’,’MarkerSize’ ,10, ’LineWidth’ ,3) 


The optimal parameter seems be close to 1.75 in Figure 11.10 (right). Since 
A is tridiagonal, symmetric, and positive definite, Theorem 8.3.2 gives the 
optimal parameter 


2 
Wopt are ar as 


Let us compute this value with Matlab 


>> D=diag(diag(A)) ; J=eye (size (A) )-inv(D) *A; rhoJ= max (abs (eig(J))) 
rhoJ = 
0.9888 
>> wopt=2/(1+sqrt (1-rhoJ*2)) 
wopt = 
1.7406 


which is indeed close to 1.75. 


11.8 Exercises of Chapter 9 


Solution of Exercise 9.3 


1. function [x, iter]=GradientC(A,b,tol,MaxIter,x) 
% Computes by the conjugate gradient method 
% the solution of system Ax = b 
% tol =e termination criterion 
% MaxIter = mazimal number of iterations 


4 X =X 

% nargin = number of input arguments of the function 

% Default values of the arguments 

if nargin==4 , x=zeros(size(b));end; 

if nargin==3 , x=zeros(size(b)) ;MaxIter=2000; end; 

if nargin==2 , x=zeros(size(b)) ;MaxIter=2000;tol=1.e-4;end; 


% Initialization 
iter=0;r=b-A*x;tol2=tol*tol ;normr2=r’ *r;p=r; 
% Iterations 
while (normr2>tol2)&(iter<MaxIter) 
Ap=Axp; 
alpha=normr2/(p’*Ap) ; 
x=xtalpha*p; 
r=r-alpha*Ap; 
beta=r’ *r/normr2; 
p=rtbeta*p; 
normr2=r’ *r ; 
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iter=iter+1; 
end; 
. The number of required iterations is always much smaller for the conjugate 
gradient method. 


>> n=5;A=LaplacianidD(n) ;xx=(1:n)’/(nt1) ;b=xx.*sin(xx) ; 
>> [xVG, iterVG]=GradientV(A,b,1.e-4,10000) ; 
>> [xCG, iterCG]=GradientC(A,b,1.e-4,10000) ; 
>> [iterVG, iterCG] 
ans = 
60 5 

>> n=10;A=LaplacianidD(n) ;xx=(1:n)’/(n+1) ;b=xx.*sin(xx) ; 
>> [xVG, iterVG]=GradientV(A,b,1.e-4,10000) ; 
>> [xCG, iterCG]=GradientC(A,b,1.e-4,10000) ; 
>> [iterVG, iterCG] 

ans = 

220 10 
>> n=20;A=LaplacianidD(n) ;xx=(1:n)’/(n+1) ;b=xx.*sin(xx) ; 
>> [xVG, iterVG]=GradientV(A,b,1.e-4,10000) ; 
>> [xCG, iterCG]=GradientC(A,b,1.e-4,10000) ; 
>> [iterVG, iterCG] 

ans = 

848 20 

>> n=30;A=LaplacianidD(n) ;xx=(1:n)’/(n+1) ;b=xx.*sin(xx) ; 
>> [xVG, iterVG]=GradientV(A,b,1.e-4,10000) ; 
>> [xCG, iterCG]=GradientC(A,b,1.e-4,10000) ; 
>> [iterVG, iterCG] 

ans = 

1902 30 


. >> n=5;A=toeplitz(n:-1:1)/12;b=(1:n)’; 
>> x0=[-2,0,0,0,10]’; 
>> [xCG, iter0]=GradientC(A,b,1.e-10,50,x0) ; 
>> xi=[-1,6,12,0,17]’; 
>> [xCG, iter1]=GradientC(A,b,1.e-10,50,x1); 
>> [iterO iter1] 
ans = 
3 5 
For the initial guess x9, the conjugate gradient algorithm has converged 
in three iterations. Let us check that the Krylov critical dimension corre- 
sponding to the residual b — Azo is equal to 2. 


>> r=b-A*x0; 
>> X=[];x=r; 
>> for k=1:n 
>> X=[X x]; [k-1 rank(X)] 
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>> x=A*x; 
>> end; 
ans = 

0 1 
ans = 

1 2 
ans = 

2 3 
ans = 

3 3 
ans = 

4 3 


For the second initial guess, the Krylov critical dimension turns out to be 
4. 

4. System Az = b is equivalent to At Ax = Atb, whose matrix AtA is sym- 
metric, positive definite. The conjugate gradient algorithm can be applied 
to the latter system. However, observe that computing AtA costs n?/2 
operations, which is very expensive, and therefore it will not be done if 
we want to minimize the computational time. Remember that the explicit 
form of A*A is not required: it is enough to know how to multiply a vector 
by this matrix. 


11.9 Exercises of Chapter 10 


Solution of Exercise 10.6 Here is a a program for the power method. 


function [1,u]=PowerD (A) 

% Computes by the power method 

% = approximation of |An| 

% u =a corresponding eigenvector 

% Initialization 

n=size(A,1);x0=ones(n,1)/sqrt(n); % £o 

converge=0;eps=1.e-6; 

iter=0;MaxIter=100; 

% beginning of iterations 

while (iter<MaxIter)&(~converge) 
u=A*x0; 
x=u/norm(u) ; 
converge=norm(x-x0)<eps; 
x0=x;iter=itert1; 

end 

1=norm(u) ; 


Application to the three suggested matrices: 
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X For matrix A, everything seems to work as expected. The algorithm re- 
turns the maximal eigenvalue that is simple. 


>> A=[2 2 10;2000; 100 2;0 0 2 -2]; 
[1,u]=PowerD(A); fprintf (°1 =%f’ ,1) 
1 =3.502384>> 
>> A=[2 2 10;2000; 100 2;0 0 2 -2]; 
[1,u]=PowerD(A); fprintf (°1 =%f \n’,1) 
1 =3.502384 
>> disp(eig(A)’) 
-3.3063 -1.2776 1.0815 3.5024 


X For matrix B, we also get the maximal eigenvalue, although it is double; 
see Remark 10.3.1. 


>> B=[15 0 9 0;0 24 0 0;9 O 15 0;0 0 O 16]; 
>> [1,u]=PowerD(B) ; 
>> fprintf C1 =4f \? 1) 
>> 1 =24.000000 
>> disp(eig(B)’) 
6 16 24 24 


X For matrix C, the algorithm does not compute the eigenvalue of maximal 
modulus, even though all eigenvalues are simple. 


>> C=[1 2 -3 4;2 1 4 -3;-3 4 1 2;4 -3 2 1]; 
>> [1,u]=PowerD(C) ; 
>> fprintf 1 =4f \’,1) 
1 =4.000000 
>> disp(eig(C)’) 
-8.0000 2.0000 4.0000 6.0000 
disp(eig(C)’) 
Explanation: let us calculate the eigenvectors of matrix D, then compare 
them with the initial data of the algorithm xp = (5, $, $, p 
>> [P,X]=eig(C); P, X 


P= 
0.5000 -0.5000 -0.5000 -0.5000 
-0.5000 -0.5000 -0.5000 0.5000 
0.5000 0.5000 -0.5000 0.5000 
-0.5000 0.5000 -0.5000 -0.5000 

X = 
-8.0000 0 0 0 
0 2.0000 0 0 
0 0 4.0000 0 


0 0 0 6.0000 
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We see that xo is an eigenvector of C corresponding to the eigenvalue 
A = 4. In this case, it is easy to see that the sequence of approximated 
eigenvectors, generated by the power method, is stationary (equal to zo). 
The sequence of eigenvalues is stationary too. 
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